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Abstract 

A general methodology is introduced for the construction and effective application of 
control variates to estimation problems involving data from reversible MCMC samplers. 
We propose the use of a specific class of functions as control variates, and we introduce a 
new, consistent estimator for the values of the coefficients of the optimal linear combination 
of these functions. The form and proposed construction of the control variates is derived 
from our solution of the Poisson equation associated with a specific MCMC scenario. The 
new estimator, which can be applied to the same MCMC sample, is derived from a novel, 
finite-dimensional, explicit representation for the optimal coefficients. The resulting variance- 
reduction methodology is primarily applicable when the simulated data are generated by a 
conjugate random-scan Gibbs sampler. MCMC examples of Bayesian inference problems 
demonstrate that the corresponding reduction in the estimation variance is significant, and 
that in some cases it can be quite dramatic. Extensions of this methodology in several 
directions are given, including certain families of Metropolis-Hastings samplers and hybrid 
Metropolis-within-Gibbs algorithms. Corresponding simulation examples are presented illus- 
trating the utility of the proposed methods. All methodological and asymptotic arguments 
are rigorously justified under easily verifiable and essentially minimal conditions. 

Keywords — Bayesian inference, control variates, variance reduction, Poisson equation, Markov chain 
Monte Carlo, log-linear model, mixtures of normals, hierarchical normal linear model, threshold autore- 
gressive model. 
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1 Introduction 



Markov chain Monte Carlo (MCMC) methods provide the facility to draw, in an asymptotic 
sense, a sequence of dependent samples from a very wide class of probability measures in any 
dimension. This facility, together with the tremendous increase of computer power in recent 
years, makes MCMC perhaps the main reason for the widespread use of Bayesian statistical 
modeling and inference across the spectrum of quantitative scientific disciplines. 

This work provides a methodological foundation for the construction and use of control 
variates in conjunction with reversible MCMC samplers. Although popular in the standard 
Monte Carlo setting, control variates have received much less attention in the MCMC literature. 
The proposed methodology will be shown, both via theoretical results and simulation examples, 
to reduce the variance of the resulting estimators significantly, and sometimes quite dramatically. 

In the simplest Monte Carlo setting, when the goal is to compute the expected value of some 
function F evaluated on independent and identically distributed (i.i.d.) samples X\, X2, . . ., the 
variance of the standard ergodic averages of the F(Xi) can be reduced by exploiting available 
zero- mean statistics. If there are one or more functions U\, U2, ■ ■ ■ , Uk - the control variates 
- for which it is known that the expected value of each Uj{X\) is equal to zero, then sub- 
tracting any linear combination, 9\U\{Xi) + #2^2(Aj) + • • • + 9kUk{Xi), from the F{X{) does 
not change the asymptotic mean of the corresponding ergodic averages. Moreover, if the best 
constant coefficients {0j} are used, then the variance of the estimates is no larger than before 
and often it is much smaller. The standard practice i n thi s setting is to estimate the o ptima l 



ana oiten it is mucn smaller, ine stanaara practice m tm s setting is to estimate tne o ptima l 
based on the sam e sequ ence of samples; see, e.g., Eli] (|200lh (Robert and Casellal f|2004J V 



or iGivens and Hoetingj (|2005l ). Because of the demonstrated effectiveness of this technique, in 
many important areas of application, e.g., in computational finance whe re Monte Ca r lo me th- 
ods are a basic tool for the approximate computation of expectations, see Glasserman ( 20041 ) . a 
major research effort has been devoted to the construction of effective control variates in specific 
applied problems. 

The main difficulty in extending the above methodology to estimators based on MCMC 
samples is probably due to the intrinsic complexities presented by the Markovian structure. On 
one hand it is hard to find nontrivial, useful functions with known expectation with respect to 
the stationary distribution of the chairQ, and, even in cases where such functions are available, 
there has been no effective way to obtain consistent estimates of the corresponding optimal 
coefficients {@j}- An important underlying reason for both of these difficulties is the basic fact 
that the MCMC variance of ergodic averages is intrinsically an infinite-dimensional object: It 
cannot be expressed in closed form as a function of the transition kernel and the stationary 
distribution of the chain. 



An early reference of variance reduction for Ma rkov chain samplers is I Green and Hani (|l992l ). 

who exploit an idea of iBarone and Frigessi (119891 1 and construct antithetic variables that may 

achiev e variance reduction in simple settings but do not appear to be widely applicable. lAndradbttir et al 
(1993) focus on finite-state space chains, they observe that optimum variance reduction can be 
achieved via the solution of the associated Poisson equation (see equation ([3]) below and Sec- 
tion l2.1l for details), an d they propose numer i cal al gorithms for its solution. Raq - Black wellisation 
has been suggested by iGelfand and Smith (ll990h and by iRobert and Casellal (l2004h as a way 



1 For example. iMenger sen et aL ll 1999TI c omment that "control variates have been advertised early in the MCMC 
literature (see, e.g.. lGreen*lmd~Han (1992)), but they are difficult to work with because the models are always 
different and their complexity is such that it is extremely challenging to derive a function with known expectation." 
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to reduce the variance of MCMC estimators. Also, iPhilippe and Robert toil ) investigated the 
use of Riemann sums as a variance reduction tool in MCMC algorithms. An interesti ng as well 
as nat ural control variate that has bee n used, mainly as a converg ence diagnostic, bv lFan et al 



_rg< 

(2006), is the score statistic. Although IPhilippe and Robertl (|2001l ) mention that it can be used 
as a control variate, its practical utility has not been investigated. Atchade and Perron ( 20051 ) 



restrict attention to independen t Metropoli s samp lers and provide an explicit formula for the 
construction of control variates. iMira et al.l ( 20031 ) note that best control variate is intimately 



connected with the solutio n of the Poisson equation and they attempt to solve it numerically. 
Ha mmer and Ha kon (l2008h construct control variates for general Metropolis-Hastings sampL 



crs 



by expanding the state space. 

In most of the works cited above, the method used for the estimation of the optimal co- 
efficients {8j} is either based on the same formula as the one obtained for control variates in 
i.i.d. Monte Carlo sampling, or on the method of batch-means, but such estimators are strictly 
suboptimal and generally ineffective; see Section [TT31 for a summary an d Section 17.2 1 for d etails. 

For our purposes, a more relevant line of work is that initiated bv iHendersonl (|1997l ). who 
observed that, for any real-valued function G defined on the state space of a Markov chain {X n }, 
the function U{x) := G(x) — E\G{X n + -\ )\X„ , = x] has zero mean with respect to the stationary 
distribution of the chain. Henderson (| 19971 ). like some of the other authors mentioned above, 
also notes that the best choice for the function G would be the solution of the associated Poisson 
equation and proceeds to compute approximations of this solution for specific Markov chains, 
with particular emphasis on models arising in stochastic network theory. 

The gist of our approach is to adapt Henderson's idea and use the resulting control vari- 
ates in conjunction with a new, efficiently implementable and provably optimal estimator for 
the coefficients The ability to estimate the effectively makes these control variates 

practically relevant in the statistical MCMC context, and avoids the need to compute analytical 
approximations to the solution of the underlying Poisson equation. 



1.1 Outline of the proposed basic methodology 

Section [2 . II introduces the general setting within which all the subsequent results are developed. 
A sample of size n from an ergodic Markov chain {AT n } is used to estimate the mean E n (F) = 
J F dir of a function F, under the unique invariant measure tt of the chain. The associated 
Poisson equation is introduced, and it is shown that its solution can be used to quantify, in an 
essential way, the rate at which the chain converges to equilibrium. 

In Section [2.21 we examine the variance of the standard ergodic averages, 

and we compare it with the variance of the modified estimators, 
1 n 

- [F(Xi) - OiUi(Xi) - e 2 u 2 {x t ) o k u k (Xi)] . (2) 

i=l 

Here and throughout the subsequent discussion, the control variates Ui,U~2,--- ,U k , are con- 
structed as above via, Uj(x) := Gj(x) — PGj(x), where PG(x) denotes the one-step expectation 
E[G{X n+ i)\X n = x], for particular choices of the functions Gj, j = 1, 2, . . . , k. 
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The two central methodological issues addressed in this work are: (i) The problem of estimat- 
ing the optimal coefficient vector {0*} that minimizes the variance of the modified estimates ([2]); 
and (ii) The choice of the functions {Gj}, so that the corresponding functions {Uj} will be effec- 
tive as control variates in specific MCMC scenarios that arise from common families of Bayesian 
inference problems. 

For the first issue, in Section [3j we derive new representations for the optimal coefficient 
vector {Oj}, under the assumption that the chain {X n } is reversible; see Proposition 2. These 
representations lead to our first main result, namely, a new estimator for {Oj}', see equations ([25]) 
and (|26p . This estimator is based on the same MCMC output and it can be used after the sample 
has been obtained, making its computation independent of the MCMC algorithm used. 

The second problem, that of selecting an effective collection of functions {Gj} for the con- 
struction of the control variates {Uj}, is more complex and it is dealt with in stages. First, in 
Section 12.21 we recall that there is always a single choice of a function G that actually makes the 
estimation variance equal to zero: If G satisfies, 

U:=G-PG = F-E W (F), (3) 

then with this control variate and with = 1 the modified estimates in ([2]) are equal to the 
required expectation E- K {F) for all n. A function G satisfying ([3]) is often called a solution to 
the Poisson equation for F [or Green's function]. But solving the Poisson equation even for 
simple functions F is a highly nontrivial task, and for chains arising in ty pical applications it is, 
for all pract ical purposes, impossible; see, e.g., the relevant comments in iHenderson and 



pr 

Mevnl (120071 ). Therefore, as a first rule of thumb, we propose that a class of functions {Gj} be 
chosen such that the solution to the Poisson equation ([3]) can be accurately approximated by a 
linear combination ^ ■ OjGj of the {Gj}. For this reason we call the {Gj} basis functions. 

Clearly there are many possible choices for the basis functions {Gj}, and the effectiveness 
of the resulting control variates depends heavily on their particular choice. In order to obtain 
more specific and immediately-applicable guidelines for the selection of the {Gj}, in Section H] we 
note that there are two basic requirements for the immediate applicability of the methodology 
described so far; the underlying chain needs to be reversible in order for the estimates of the 
coefficient vector {0*} introduced in Section [3] to be consistent, and also, the one-step expec- 
tations PGj(x) := E[Gj{X n+ i)\X n = x] that are necessary for the construction of the control 
variates Uj need to be explicitly computable. 

Since the most commonly used class of MCMC algorithms satisfying both of these require- 
ments is that of conjugate random-scan Gibbs samplers]! and since the most accurate general 
approximation of the target distribution ir arising in Bayesian inference problems is a gen- 
eral multivariate Gaussian, we examine this MCMC problem in detail and obtain our second 
main result. Suppose that we wish to estimate the mean of one of the co-ordinates of a k- 
dimensional Gaussian distribution tt, based on samples Xi = (JQ , X^ , . . . ,x\ k ^) 1 generated 
by the random-scan Gibbs algorithm. In Theorem 1 we show that the solution of the associated 
Poisson equation can always be expressed as a linear combination of the k co-ordinate functions 
Gj(x) := xV', x = (x^ ,x <y2 \ . . . ,x^Y £ M. k . This is perhaps the single most interesting case 
of a Markov chain naturally arising in applications for which an explicit solution to the Poisson 
equation has ever been obtained. 



2 Following standard parlance, we call a Gibbs sampler 'conjugate' if the full conditionals of the target distribu- 
tion are all known and of standard form. This, of course, is unrelated to the notion of a conjugate prior structure 
in the underlying Bayesian formulation. 
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The above results naturally lead to the following proposal, stated in detail in Section HJ It 
is the core methodological contribution of this work. 

The basic methodology. Suppose ir is a multivariate posterior distribution for which 
MCMC samples are obtained by a reversible Markov chain {A n }. In order to estimate 
the posterior mean /iW of the ith co-ordinate x^\ let F(x) = x^\ define basis 
functions Gj as the co-ordinate functions Gj(x) = x^' for all components j for 
which PGj{x) = E[X { n % | X n = x] is explicitly computable, and form the control 
variates Uj = Gj — PGj. 

Then estimate the optimal coefficient vector 9* = {9*} by the estimator 9 = 9 n ^ 
given in ([25]), and estimate the posterior mean of interest y$ by the modified esti- 
mators given in ([27j) : 

1 n 

Mn,K(^) := - i F ( X ^ ~ & i U i( X i) ~ ^U 2 (Xi) 9 k U k (Xi)] . (4) 

i=l 

Section [5] contains three MCMC examples using this methodology. Example 1 is a brief 
illustration of the result of Theorem 1 in the case of a bivariate Gaussian. As expected, the 
modified estimators are seen to be much more effective than the standard ergodic averages (H|), 
in that their variance is smaller by a factor ranging approximately between 4 and 1000, depending 
on the sample size. Example 2 contains an analysis of a realistic Bayesian inference problem 
via MCMC, for a 66-parameter hierarchical normal linear model. There, we consider all 66 
problems of estimating the posterior means of all 66 parameters, and we find that in most cases 
the reduction in variance resulting from the use of control variates as above is typically by a 
factor ranging between 5 and 30. The third example illustrates the use of the basic methodology 
in the case of Metropolis-within-Gibbs sampling from a heavy-tailed posterior. Even though 
the posterior distribution is highly non-Gaussian, and also the one-step expectation PGj can be 
computed for only one of the two model parameters, we still find that the variance is reduced 
by a factor ranging approximately between 7 and 10. 

1.2 Extensions and applications 

Domain of applicability. The present development not only generalizes the classical method of 
control variates to the MCMC setting, but it also offers an important advantage. In the case 
of independent sampling, the control variates for each specific application need to be identified 
from scratch, often in an ad hoc fashion. In fact, for most Monte Carlo estimation problems 
there are no known functions that can be used as effective control variates. In contrast, the basic 
methodology described above provides a way of constructing a family of control variates that are 
immediately applicable to a wide range of MCMC problems, as long as the sampling algorithm 
produces a reversible chain for which the one-step expectations PG{x) := E[G(X n+ \)\X n = x] 
can be explicitly computed for some simple, linear functions G. MCMC algorithms with these 
properties form a large collection of samplers commonly used in Bayesian inference, including, 
among others: All conjugate random-scan Gibbs samplers (the main MCMC class considered in 
this work), certain versions of hybrid Metropolis-within-Gibbs algorithms, and certain types of 
Metropolis-Hastings samplers on discrete states spaces. 
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In Section Owe discuss extensions of the basic methodology along three directions that, in 
some cases, go beyond the above class of samplers. 

General classes of basis functions {Gj}. Section [6] begins with the observation that, although 
it is often effective to take the basis functions {Gj} to be the co-ordinate functions of the chain, 
in certain applications it is natural to consider different families of {Gj}. Sometimes it is the 
structure of the MCMC sampler itself that dictates a more suitable choice, while sometimes, in 
estimation problems involving more complex models - particularly in cases where it is expected 
that the posterior distribution is highly non-Gaussian - it may be possible to come up with 
more effective basis functions {Gj} by considering the form of the associated Poisson equation, 
as indicated by the first rule of thumb stated earlier. Finally, when the statistic of interest is 
not the posterior mean of one of the model parameters, there is little reason to expect that 
the co-ordinate functions {Gj} would still lead to effective control variates; instead, we argue 
that it is more appropriate to choose functions Gj leading to control variates Uj that are highly 
correlated with the statistic of interest. 

Particular instances of these three directions are illustrated via the three MCMC examples 
presented in Section [U] and summarized below. 

'N on- conjugate ' samplers. In Example 4, a non-conjugate Gibbs sampler is used for the analysis 
of the posterior of a 7-parameter log-linear model. There, the conditional expectations PGj of 
the co-ordinate functions Gj are not available in closed form, so that the basic methodology 
cannot be applied. But the nature of the sampling algorithm provides a natural and in a sense 
obvious alternative choice for the basis functions, for which the computation of the functions PGj 
needed for the construction of the associated control variates Uj = Gj — PGj is straightforward. 
With this choice of control variates, using the modified estimator in Q to estimate the posterior 
means of the seven model parameters, we find that the variance is approximately between 3.5 
and 180 times smaller than that of the standard ergodic averages (UJ. 

Complex models. Example 5 is based on a two-component G aussian mixtures model, endowed 
with the vague prior structure of lRichardson and Greenl ([1997) . To estimate the posterior means 
of the two parameters corresponding to the Gaussian locations, we use a random-scan Gibbs 
sampler (in blocks). Here, although the prior specification is conditioned on the two means 
being ordered, it is possible to generate samples without the ordering restriction, and then to 
order the samples at the post-processing stage. The resulting Markov chain has an apparently 
quite complex structure, yet we show that the basic methodology can easily be applied in this 
case. What makes this example interesting for our present purposes is that, on one hand, the 
control variates based on the co-ordinate basis functions {Gj} make little (if any) difference in 
the estimation accuracy, while choosing a different collection of basis functions {Gj} gives a very 
effective set of control variates. Specifically, appealing to the first rule of thumb stated earlier, 
we define two simple basis functions G±,G2 such that, a linear combination of G\ and G2 can 
be seen, heuristically at least, to provide a potentially accurate approximation to the solution of 
the associated Poisson equation. With these control variates, the resulting estimation variance 
is reduced by a factor ranging approximately between 15 and 55. 

General statistics of interest. The above general methodology was based on the assumption 
that the goal was to estimate the posterior mean of one of the parameters. But in many cases 
it may be a different statistic that is of interest. For example, it may be desirable to estimate 
the probability that one of the model parameters, <f> say, would lie in a particular range, e.g., 
4> G (a, b); here, the function F whose posterior mean is to be estimated would be the indicator 



5 



function of the event {eft £ (a, 6)}. In such cases, there is little reason to believe that the co- 
ordinate functions {Gj} would still lead to effective control variates. Instead, we argue that it 
is more appropriate, in accordance to the second rule of thumb derived in Section \2.2l to choose 
the Gj so that the associated control variates Uj := Gj — PGj are highly correlated with F. 

A particular such instance is illustrated in Example 6, where we examine a Bayesian model- 
selection problem arising from a two-threshold autoregressive model after all other parameters 
have been integrated out. The resulting (discrete) model space is sampled by a discrete random- 
walk Metropolis-Hastings algorithm. There, the quantity of interest is the posterior probability 
of a particular model. Defining F as the indicator function of that model, a natural choice 
for the basis functions {Gj} is to take G% = F, and G2,G^ to be the indicator functions of 
two different models that are close to the one whose posterior probability is being estimated. 
Because of the discrete nature of the sampler, the conditional expectations PGj can be easily 
evaluated, and with these basis functions we find that the variance of the modified estimators 
in (UJ) is at least 30 times smaller than that of the standard ergodic averages ([T]). 



1.3 Further extensions and results 



In Section [7] we first briefly discuss two other consistent estimators for the optimal coefficient 
vector {Oj}- One is a modified versio n of ou r earlie r estimator 9 n ^ derived in Section[3l and the 
other one was recently developed by iMevnl (|2007l ) based on the so-called "temporal difference 
learning" algorithm. Then in Section 17.21 we examine the most common estimator for the 
optimal coefficient vector {9*} that has been used in the literature, which as mentioned earlier 
is based on the method of batch-means. In Proposition 3 we show that the resulting estimator 
for tt(F) is typically strictly suboptimal, and that the amount by which its variance is larger 
than the variance of our modified estimators / u„ i K(i ? ) is potentially unbounded. Moreover, 
the batch-means estimator is computationally more expensive and generally rather ineffective, 
often severely so. This is illustrated by re-visiting the three MCMC examples of Section [5] 
and comparing the performance of the batch-means estimator with that of the simple ergodic 
averages ([1]) and of our modified estimator (i nj K.(F) in (UJ). 

Section [8] provides complete theoretical justifications of the asymptotic arguments in Sec- 
tions [21 El and El Finally we conclude with a short summary of our results and a brief discussion 
of possible further extensions in Section [9l with particular emphasis on implementation issues 
and on the difficulties of applying the present methodology to general Metropolis- Hastings 
samplers. 



We close this in t roduc tion with a few more remarks on previous related work. As mentioned 
earlier, iHendersonl (|l997n takes a different path toward optimizing the use of control variates 
for Markov chain samplers. Considering primarily continuous-time processes, an approximation 
for the solution to the associated Poisson equation is derived from the so-called "heavy traffic" 
or "fluid model" approximations of the original process. The motivation and application of 
this method is mostly related to examples fr om stochastic network theory and queueing theory. 
Closel y related approaches are presented bv Henderson and Glvnnl (2002J) and iHenderson et al. 



(120031 ). where the effectiveness of multiclass network control policies is evaluated via Markovian 
simulation. Control variates are used for variance reduction, and the optimal coefficients {Gj} 
are estimated via an adaptive, stochastic gradient algo rithm. General convergence p roperties of 
ergodic estimators using control variates are derived bv IHenderson and Simon (2004), in the case 
when the solution to the Poisson equation (either for the original chain or for an approximating 
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chain) is known explicitly. Kim and HendersonI ( 2007 ) introduce two related adaptive methods 
for tuning non-linear versions of the coefficients {#■?}, when using families of control variates 
that naturally admit a non-linear parameterization. They derive asymptotic properties for these 
estimators and present numerical simulation results. 

When the control variate U = G — PG is defin ed in t erms of a function G that can be 
taken as a Lyapunov function for the chain {X n }, iMevnl (120061) derive s precise exponential 
asymptotics for the associated modified estimators. Also, IMevnl (|2007l . Chapter 11) gives a 
development of the general control variates methodology for Markov chain data that parallels 
certain parts of our presentation in Section [2J and discusses numerous related asymptotic results 
and implementation issues. 



In a different direction, IStein et al 



(|2004l ) draw a connection between the use of control 
variates in MCMC and the "exchangeable pairs" construction used in Stein's method for distri- 
bution approximation. They consider a natural class of functions as their control variates, and 
they estimate the associated coefficients {9j} by a si mple version of the ba t ch-m eans method 
described in Section [7.21 Finally, the recent work by Delmas and Jourdain ( 20091 ) examines a 
particular case of Hender son's construction of a con trol variate in the context of Metropolis- 
Hastings sampling. Like lHammer and Hakon (120081 1. the authors expand the state space to 
include the proposals and they first take G = F and 9 = 1 (which, in part, explains why their 
WR algorithm is sometimes worse than plain Metropolis sampling). They identify the solution 
of the Poisson equation as the optimal choice for a basis function and they seek analytical ap- 
proximations. Then a general linear coefficient 9 is introduced, and for a particular version of 
the Metropolis algorithm the optimal value 9* is identified analytically. 



2 Control Variates for Markov Chains 
2.1 The setting 

Suppose {^"n} is a discrete-time Markov chain with initial state Xq = x, taking values in 
the state space X, equipped with a cr-algebra B. In typical applications, X will often be a 
(Borel measurable) subset of R rf together the collection B of all its (Borel) measurable subsets. 
[Precise definitions and detailed assumptions are given in Section 0] The distribution of {X n } 
is described by its transition kernel, P(x,dy), 

P(x,A) :=Pr{X k+1 £ A\X k =x}, x£X,AgB. (5) 

As is well-known, in many applications where it is desirable to compute the expectation 
E n (F) := vr(-F) := J F dn of some function F : X — > R with respect to some probability measure 
7r on (X, B), although the direct computation of ir(F) is impossible and we cannot even produce 
samples from tt, it is possible to construct an easy-to-simulate Markov chain {A n } which has tt 
as its unique invariant measure. Under appropriate conditions, the distribution of X n converges 
to tt, a fact which can be made precise in several ways. For example, writing PF for the function, 

PF(x) := E X {F{X X )] := £7[F(Xi) \X = x], x G X, 

then, for any initial state x, 

P n F{x) := E[F(X n ) | X = x] -> tt(F), as n -> oo, 



7 



for an appropriate class of functions F : X 
be quantified by the function, 



I. Furthermore, the rate of this convergence can 



p(x) = J2 [P n F(x) - tt(F) 

n=0 



where F is easily seen to satisfy the Poisson equation for F, namely, 

PF — F = -F + it(F). 



(6) 



(7) 



[To see this, at least formally, apply P to both sides of © and note that the resulting series for 
PF — F becomes telescoping.] 

The above results describe how the distribution of X n converges to tt. In terms of estimation, 
the quantities of interest are the ergodic averages, 



n— 1 



i=0 



Again, under appropriate conditions the ergodic theorem holds, 

Hn{F) Tt(F)> a - s -i as n — )• oo, 



(8) 



(9) 



for an appropriate class of functions F. Moreover, the rate of this convergence is quantified by 
an associated central limit theorem, which states that, 

1 n— 1 

v^K(F) - tt{F)] = -=Y\F{X i )-ir{F)\ -^N(0,o F ), as n -> oo, 

where <jp, the asymptotic variance of F, is given by, op := \iv[i n ^ 00 \ai. K {y/nii n {F)). Alterna- 
tively, it can be expressed in terms of F as, 



4 = vr F 2 



(PF) 



(10) 



The results in equations © and (fTUj) clearly indicate that it is useful to be able to compute 
the solution F to the Poisson equation for F. In general this is a highly nontrivial task, and, 
for chains arising in ty pical applications , it is imposs i ble fo r all practical purposes; see, e.g., the 
relevant comments in iHendersonl (|1997l ) and iMevnl (|2007l ). Nevertheless, the function F will 
play a central role throughout our subsequent development. 



2.2 Control variates 

Suppose that, for some Markov chain {X n } with transition kernel P and invariant measure tt, 
the ergodic averages [i n (F) as in (JSj) are used to estimate the mean vr(F) = f F dir of some 
function F under tt. In many applications, although the estimates /U n (F) converge to tt(F) as 
n — > oo, the associated asymptotic variance op is large and the convergence is very slow. 

In order to reduce the variance, we employ the idea of using control variates, as in the 
case of simple Monte Carlo with i ndependent and identicall y dist r ibute d (i.i.d.) samples; see, 
for example, the standard texts of Robert and Casella ( 2004 ). Liu ( 200ll ). Givens and Hoeting 
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(2005), or the paper by iGlvnn and Szechtmanl ([2002) for extensive discussions. Given one or 



more functions U\, U2, ■ ■ ■ , ETjj., the control variates, such that Uj : X — > R and n(Uj) = for all 
j = 1, 2, . . . , k, let 9 = (61,62, ■ ■ ■ , 6k) 1 be an arbitrary, constant vector in R fc , and define, 

k 

F :=F-(6,U) = F-Y,O j U J , (11) 

3=1 

where U : X — > IR fc denotes the column vector, U = (U\, U2, ■ ■ ■ , [Here and throughout 

the paper all vectors are column vectors unless explicitly stated otherwise, and (•, •) denotes the 
usual Euclidean inner product.] 

We consider the modified estimators, 

k 

Vn{F e ) = » n {F) ~ (6, ll n {U)) = H n (F) - £ 6j fl n (Uj), (12) 

J'=l 

for n(F). The ergodic theorem ([9]) guarantees that the estimators {fi n (Fg)} are consistent with 
probability one, and it is natural to seek particular choices for U and 6 so that the asymptotic 
variance a 2 Fe of the modified estimators is significantly smaller than the variance a 2 p of the 
standard ergodic averages [i n (F). 

Throughout this work, we wi l l conc entrate exclusively on the following class of control vari- 
ates U proposed by HendersonI ( 19971 ). For arbitrary (7r-integrable) functions Gj : X — > R 



define, 

:= — PGj, j = 1, 2, . . . , k. 

Then the invariance of ir under P and the integrability of Gj guarantee that ir(Uj) = 0. 

In the remainder of this section we derive some simple, general guidelines for choosing func- 
tions {Gj} that produce effective control variates {Uj}. This issue is revisited in more detail in 
the Bayesian MCMC context in Section HI 

Suppose, at first, that we have complete freedom in the choice of the functions {Gj}, so that 
we may take k = 1, a single U = G — PG, and 6 = 1 without loss of generality. Then the goal 
is to make the asymptotic variance of F — U = F — G + PG as small as possible. But, in view 
of the Poisson equation (JJJ), we see that the choice G = F yields, 

F-U = F-F + PF = tt{F), 

which has zero variance. Therefore, the general principle for selecting a single function G is: 

Choose a control variate U = G — PG with G ~ F. 

As mentioned above, it is typically impossible to compute F for realistic models used in appli- 
cations. But it is often possible to come up with a guess G that approximates F, or at least 
with a collection of functions {Gj} such that F can be approximated as linear combination of 
the {Gj}. Thus, our first concrete rule of thumb for choosing {Gj} states: 

Choose control variates Uj = Gj — PGj, j = 1, 2, . . . ,k with respect to a collection 
of basis functions {Gj}, such that F can be approximately expressed 
as a linear combination of the {Gj}. 
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The terminology basis functions for the {Gj} is meant to emphasize the fact that, although F 
is not known, it is expected that it can be approximately expressed in terms of the {Gj} via a 
linear expansion of the form F « £j =1 Qfij. 

Once the basis functions {Gj} are selected, we form the modified estimators fJ, n (Fg) with 
respect to the function Fq as in (fTT|) . 

Fg = F — (9,U) = F — (9, G) + (0, PG), 

where, for a vector of functions G = {G\, G2, ■ ■ • , Gk) 1 , we write PG for the corresponding 
vector, [PG\,PG%, . . . , PGf.) . The next task is to choose 6 so that the resulting variance, 

^ 2 :=4 e =vr(F|-(PF e ) 2 ), 
is minimized. Note that, from the definitions, 

Uj = Gj for each j, and Fq = F — (9, G). 
Therefore, by (fTOjl and linearity, 

a 2 e = o- 2 F -2ir(P{9,G)-PF{9,PG)')+iT({9,G) 2 -{9,PG) 2 y (13) 

To find the optimal 9* which minimizes the variance cr?, differentiating the quadratic a 2 , with 
respect to each 9j and setting the derivative equal to zero, yields, in matrix notation, 

T(G)9* = ir(FG - (PF)(PG)), 

where the k x k matrix T(G) has entries, T(G)ij = ir(GiGj — (PGi)(PGj)). Therefore, 

9* = r(G)- 1 7r{FG - (PF)(PG)), (14) 

as long as T(G) is invertible. Once again, this expression depends on F, so it is not immediately 
clear how to estimate 9* directly from the data {X n }. The issue of estimating the optimal 
coefficient vector 9* is addressed in detail in Section [3l but first let us interpret 9*. 

For simplicity, consider again the case of a single control variate U = G — PG based on a 
single function G. Then the value of 9* in f)14[) simplifies to, 

tt(FG - (PF)(PG)) tt(FG - (PF)(PG)) ^ 
n(G 2 - (PG) 2 ) 

where the second equality follows from the earlier observation that U = G. Alternatively, 
starting from the expression, a 2 , = \mi n ^ > . 00 \ai 1T (y/np in (Fo)) , simple calculations lead to, 

00 

al = a 2 F + e 2 a 2 u -29 £ Cov n (F(X ),U(X n )), (16) 

n=— 00 

so that 9* can also be expressed as, 

1 00 

r = ^T E Cov 7T (F(X ),U(X n )), (17) 

U n=—oo 



V 
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where Cov^ denotes the covariance for the stationary version of the chain, i.e., since tt(U) = 0, 
we have Cov 7T (F(X ),U(X n )) = E v [F(X )U(X n )], where X ~ tt. Then leads to the 
optimal asymptotic variance, 

1 r °° 12 
al=a 2 F -—\ Cov„(F(X ),U(X n )) . (18) 

<Jtt L 



n=— 00 



Therefore, in order to reduce the variance, it is desirable that the correlation between F and U 
be as large as possible. This leads to our second rule of thumb for selecting basis functions: 

Choose control variates U = G — PG so that each Uj is highly correlated with F. 

Incidentally, note that, comparing the expressions for 9* in (|15p and ()1T|) implies that, 

00 

Cov w (F(X ),U(X n ))=ir(FG-(PF)(PG)). (19) 



3 Estimating the Optimal Coefficient Vector 6* 

Consider, as before, the problem of estimating the mean tt(F) of a function F : X — > K based 
on samples from an ergodic Markov chain with unique invariant measure tt and transition 

kernel P. Instead of using the ergodic averages fJL n (F) as in we select a collection of basis 
functions {Gj} and form the control variates Uj = Gj — PGj, j = 1,2, ... ,k. The mean tt(F) 
is then estimated by the modified estimators fi n (Fg) as in (|12p . for a given coefficient vector 
6 e R k . 

Under the additional assumption of reversibility, in this section we introduce a consistent 
procedure for estimating the optimal coefficient vector 6* based on the same sample Then 
in Section 2] we give more detailed guidelines for choosing the {Gj}. 

Recall that, once the basis functions {Gj} have been selected, the optimal coefficient vector 
9* was expressed in flU]) as, 9* = r(G)-^(FG - (PF)(PG)), where r(G%- = ir(GiGj - 
(PGi)(PGj)), 1 < i,j < k. But, in view of equation (fT9|) derived above, the entries T(G)ij can 
also be written, 

T(G)ij := ir(GiGj - (PGi)(PGj)) (20) 

00 

= n(UiGj - (PUi)(PGj)) = CovAUi(X ),Gj(X n )). 

n=— 00 

This indicates that T(G) has the structure of a covariance matrix and, in particular, it suggests 
that T(G) should be positive semidefinite. Indeed: 

Proposition 1. Let K(G) denote the covariance matrix of the random variables 

Yj := Gj(Xt) - PGj(X ), j = 1, 2, . . . , k, 
where Xq ~ tt. Then T(G) = K(G), that is, for all 1 < i,j < k, 
TTidGj - (PGi)(PGj)) = K(G)ij := eMg^) - PGi(X )) (g s {X x ) - PGj(X jj\ . (21) 
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Proof. Expanding the right-hand side of (|2ip yields, 

ir(GiGj) - E^GiiX^PGjiXo)} - E v [G j (X 1 )PG i (X )] + ^{{PG^PGj)), 

and the result follows upon noting that the second and third terms above are both equal to the 
fourth. To see this, observe that the second term can be rewritten as, 

E^E^X^PGjiXo) \X ]} = E v [E[Gi(Xi) | X ]PG^(X )] = ^((PG^PC,)), 

and similarly for the third term. □ 
Therefore, using Proposition 1 the optimal coefficient vector 9* can also be expressed as, 

9* = K(G)- 1 it(FG- (PF)(PG)). (22) 

Now assume that the chain {X n } is reversible. Writing A = P — I for the generator of {X ra }, 
reversibility is equivalent to the statement that A is self-adjoint as a linear operator on the space 
£2(71") • I n other words, 

vr(FAG) = vr(AFG), 

for any two functions F,G £ -^2( 7r )- Our first main theoretical result is that the optimal 
coefficient vector 9* admits a representation that does not involve the solution F of the associated 
Poisson equation for F: 

Proposition 2. If the chain {Xn} is reversible, then the optimal coefficient vector 9* for the 
control variates Ui = Gi — PG{, i = 1, 2, . . . , k can be expressed as, 

9* = 9* rev = r(G)- 1 vr((F - tt(F))(G + PG)) , (23) 

or, alternatively, 

9* rev = K(G)-\{(F - n(F))(G + PG)) , (24) 

where the matrices T(G) and K(G) are defined in (j20|) and (|2ip . respectively. 

Proof. Let F = F — n(F) denote the centered version of F, and recall that F solves Poisson's 
equation for F, so PF = F — F. Therefore, using the fact that A is self-adjoint on each 
component of G, 

ir(FG - (PF)(PG)) = ir(FG - (F - F)(PG)) 

= ir(FPG - FAG) 
= tt(FPG-AFG) 
= n(FPG + FG) 
= it(F(G + PG)). 

Combining this with (|14p and (|22p . respectively, proves the two claims of the proposition. □ 
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The expression ([23]) suggests estimating 9* via, 



9 n , K = K n (G)- 1 [/i„,(F(G + PG))-/i n (F)^ n (G + PG)], (25) 
where the empirical k x k matrix K n (G) is defined by, 

1 n— 1 

(K n (G))y = — -^(G i (X t )-PG i (X i _ 1 ))(G j (X t )-PG J (Xi_ 1 )). (26) 
n t=i 

The resulting estimator [i n {Fn ) for 7r(-F) based on the vector of control variates U = G — PG 
and the estimated coefficients # nj K is defined as, 

Hu,k{F) := n n (F § ) = Hn{F) - (6 ntK ,»n(U)). (27) 

This will be the main estimator used in the remainder of the paper. 



4 The Choice of Basis Functions and the Basic Methodology 

Given an ergodic, reversible Markov chain {X n } with invariant measure tt, and a function F 
whose mean tt{F) is to be estimated based on samples from the chain, we wish to find a vector 
of control variates U = G — PG so that, for an appropriately chosen coefficient vector 9, the 
variance of the modified estimates fi n (Fg) = n n (F) — (9,U) in (jlip is significantly lower than 
that of the simple ergodic averages \x n (F) in (jSJ) . 

As noted in the Introduction, like with simple Monte Carlo estimation based on i.i.d. samples, 
there cannot possibly be a single, universal family of control variates that works well in every 
problem. Indeed, the choice of effective basis functions G = {Gj} is important for constructing 
useful control variates Uj = Gj — PGj for variance reduction. Nevertheless - and perhaps 
somewhat surprisingly - as shown next, it is possible to select basis functions {Gj} that provide 
very effective variance reduction in a wide range of MCMC estimation scenarios stemming 
from Bayesian inference problems, primarily those where inference is performed via a conjugate 
random-scan Gibbs sampler. Examples of this basic methodology are presented in Section [SJ 
Extensions and further examples are discussed in Section [6] 

Our starting point is the observation that in order to apply the results developed so far, 
the MCMC sampler at hand needs to be reversible so that the estimator 9 n ^ in ([25]) for the 
optimal coefficient vector 9* can be used, and also it is necessary that the one-step expec- 
tations PGj{x) = E[Gj{X n+ i)\X n = x] of the basis functions Gj should be explicitly com- 
putable in closed form. Probably the most natural, general family of MCMC algorithms 
that satisfy these two requirements is the collection of conjugate random-scan Gibbs sam- 
plers, with a target distribution tt arising as the posterior density of the parameters in a 
Bayesian inference s tudy. Moreover, si n ce for l arge sampl e sizes t he posterior is approx i matel y 
normal - see, e.g., Bick el and Yahavl (jl969h . iBlackwelll (|l985h . iBunfce and Milhaudl (jl998h . 



1 ^ 1 I I I | ■ I ■/ 7 ■ 1 'I ■/ 7 l — , -| 1/ 7 

Ibragimov and Has^inskij ( 198 ll ) - we focus on the case where tt is a multivariate normal 



distribution. 

According to the discussion in Section I2.2[ the main goal in choosing the basis functions 
G = {Gj} is that it should be possible to effectively approximate the solution F of the Poisson 
equation for F as a linear combination of the Gj, i.e., F ~ X*j=i ^ ^ s somewhat remarkable 
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that, in the case of random-scan Gibbs sampler with a Gaussian target density, the Poisson 
equation can be solved explicitly, and its solution is of a particularly simple form: 

Theorem 1. Let {X n } denote the Markov chain constructed from the random-scan Gibbs 
sampler used to simulate from an arbitrary (nondegenerate) multivariate normal distribution 
7r ~ N(fi, S) in M. k . If the goal is to estimate the mean of the first component of tt, then letting 
F(x) = x^' for x = (x^jX^, . . . ,x^Y £ the solution F of the Poisson equation for F 
can be expressed as linear combination of the co-ordinate basis functions Gj(x) := x^\ x£K fc , 
1 < j < k, 

k 

F = J29 3 G r (28) 



Moreover, writing Q = the coefficient vector 9 in (|28p is given by the first row of the matrix 
k{I — A)^ 1 , where A has entries Aij = —Qij/Qu, 1 < i ^ j < k, An = for all i, and (I — A) 
is always invertible. 

Proof. Let H denote the candidate solution to the Poisson equation, H{x) = ^ • 9jX^\ and 
write X = {X^\X^\ . . . ,X^) for a random vector with distribution tt ~ N(/i, £). Since tt 
is nondegenerate, S is nonsingular and so the precision matrix Q exists and its diagonal entries 
are nonzero. Since the conditional expectation of a component X^ of X given the values of the 
remaining X^ := (X^ , . . . , X^-^,X^ +1 \ X^) is ^ + £ € A je [X^ - ^% we have, 



PH{x) = J>J ^ x(J) + % + E A ^{ x(e) ~ ^ 



so that, 



PH{x)-H{x) = 



j 

= -l-9\l-A){x-ix), 

where we have used the fact that Yli Aj^(x^ — fi^*)) = -^ji{ x ~ since the diagonal 
entries of A are all zero. In order for this to be equal to —F(x) + ir(F) = — (a;' 1 ' — for all x, 
it suffices to choose 9 such that 9 t (I — A) = (k, 0, . . . , 0), as claimed. Finally, to see that (J — A) 
is nonsingular (and hence invertible), note that its determinant is equal to []X (l/Qjj)]det(Q), 
which is nonzero since S is nonsingular by assumption. □ 

In terms of MCMC estimation, the statement of Theorem 1 can be rephrased as follows: 
Suppose that samples from a multivariate Gaussian distribution are simulated via a random- 
scan Gibbs sampler in order to estimate the mean of one of its components. Then, using the 
linear basis functions Gj(x) = x^' to construct a vector of control variates U = G — PG, 
the modified estimator fJL n> fc(F) as in (j27[) should have dramatically smaller variance than the 
standard ergodic averages fj, n (F); in fact, the asymptotic variance of Hu,k{F) in the central limit 
theorem is equal to zero. This theoretical prediction is verified in a simple simulation example 
presented in Section 

Thus motivated, we now state the basic version of the variance reduction methodology 
proposed in this work. 
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Outline of the Basic Methodology 

(i) Given: 

• A multivariate posterior distribution tt(x) = Tr(x^\ x^ 2 \ . . . , x^)) 

• A reversible Markov chain {X„} with stationary distribution ir 

• A sample of length n from the chain {A„} 

(ii) Goal: 

• Estimate the posterior mean //W of 

(iii) Define: 

• F(x) = 

• Basis functions as the co-ordinate functions Gj(x) = x^ 

for all j for which PGj(x) := E[X^ \X n = x] 
is computable in closed form 

• The corresponding control variates Uj = Gj — PGj 

(iv) Estimate: 

• The optimal coefficient vector 6* by 8 n ^ as in (|25p 

• The quantity of interest /«W by the modified estimators 

fi njK (F) as in (J2ZD 



Before examining the performance of this methodology in practice we recall that the Markov 
chain in Theorem 1 is perhaps the single most complex example of a Markov chain naturally 
arising in an applied context, for which there is an explicit solution to the Poisson equation. 

Finally we mention that, even if the posterior distribution ir to be simulated is not Gaussian, 
in many cases it is possible to re-parametriz e the m ode l so that n is either Gaussian or ap prox- 
imately Gaussian; see, e.g., Hills and Smith ( 19921 ) and Tibshirani and Wasserman ( 1994 ). 



5 MCMC Examples of the Basic Methodology 

Here we present three examples of the application of the basic methodology outlined in the 
previous section. The examples below as well as those in Section [6] are chosen as representative 
cases covering a broad class of real applications of Bayesian inference. 

In Example 1, a bivariate normal density is simulated by random-scan Gibbs sampling. 
This setting is considered primarily as an illustration of the result of Theorem 1, and also as a 
simplified version of many examples in the large class of inference studies with an approximately 
normal posterior distribution. As expected, the variance of the modified estimators in this case 
is dramatically smaller. 

Example 2 considers a case of Bayesian inference via MCMC, with a large hierarchical normal 
linear model. In part due to the complexity of the model, it is perhaps natural to expect that 
the posterior is approximately Gaussian, and, indeed, the basic methodology of the previous 
section is shown to provide very significant variance reduction. 

Finally, in Example 3 the use of the basic methodology is illustrated in the case of Metropolis- 
within-Gibbs sampling from the posterior of a "difficult" model, where the use of heavy-tailed 
priors results in heavy-tailed posterior densities. Such densities are commonly met in, for exam- 
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pie, spatial statistics; see IDellaportas and Roberts! (|20Q3l ) for an illustrative example. Even in 
this case where the posterior is certainly not approximately normal, it is demonstrated that the 
basic methodology is still very effective in reducing the estimation variance. This example also 
illustrates the point that, often, not all basis function Gj can be easily used in the construction 
of control variates, as indicated in the second part of step {Hi) of the description of the basic 
methodology above. 



Example 1. The bivariate Gaussian through the random-scan Gibbs sampler. Let 

(X, Y) ~ 7r(x, y) be an arbitrary bivariate normal distribution, where, without loss of generality, 
we take the expected values of both X and Y to be zero and the variance of X to be equal to 
one. Let Var(Y) = r 2 and the covariance E{XY) = pr for some p 6 (— 1, 1). Given arbitrary 
initial values xq = x and yo = y, the random-scan Gibbs sampler selects one of the two co- 
ordinates at random, and either updates y by sampling from n{y\x) ~ N(prx, t 2 (1 — p 2 )), 
or x from ir(x\y) ~ N{^y, 1 — p 2 ). Continuing this way produces a reversible Markov chain 
{(X n ,Y n )} with distribution converging to it. To estimate the expected value of X under ir we 
set F{x, y) = x, and define the basis functions G±{x, y) = x and G 2 {x, y) = y. The corresponding 
functions PG\ and PG2 are easily computed as, 



PG 1 (x,y) = ~ 



py 

T 



and PG 2 (x,y) = ~{y + prx). 



The parameter values are chosen as, p = 0.99 and r 2 = 10, so that the two components 
are highly correlated and the sampler converges slowly, making the variance of the standard 
estimates p n {F) large. Using the samples produced by the resulting chain with initial values 
x o = yo = 0.5, we examine the performance of the modified estimator p n ^{F), and compare it 
with the performance of the standard ergodic averages p n (F). 

Figure [JJ depicts a typical realization of the sequence of estimates obtained by the two 
estimators, for n = 20000 simulation steps; the factors by which the variance of p- n {F) is larger 
than that of p n ^{F) are shown in Table [TJ In view of Theorem 1, it is not surprising that the 
estimator p n ^{F) is clearly much more effective than p n {F). 



Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 10000 


n = 50000 


n = 100000 


n = 200000 


n = 500000 


Vn,K{F) 


4.13 


27.91 


122.4 


262.5 


445.0 


1196.6 



Table 1: Estimated factors by which the variance of fin{F) is larger than the variance of h„,k(F), after n = 
1000, 10000, 50000, 100000, 200000 and 500000 simulation steps. 

In this and in all subsequent examples, the reduction in the variance was computed from 
independent repetitions of the same experiment: For p n (F), F = 200 different estimates /i^ (F), 
for i = 1, 2, . . . , T, were obtained, and the variance of p n (F) was estimated by, 

1 T 

i=i 

where p n (F) is the average ofthe/^F). The same procedure was used to estimate the variance 
of Pu.k{F). 
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2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 



Figure 1: The sequence of the standard ergodic averages is shown as a solid blue line and the modified estimates 
Mi,k(J 7 ') as red diamonds. For visual clarity, the values fi n ,K(F) are plotted only every 350 simulation steps. 

Example 2. A hier archical normal linear model. In an early application of MCMC in 
Bayesian statistics, Gelfand et al. (1990) illustrate the use of Gibbs sampling for inference in a 
large hierarchical normal linear model. The data consist of N = 5 weekly weight measurements 
of £ = 30 young rats, whose weight is assumed to increase linearly in time, so that, 



Yij ~ N (ai + PiXij,o-l) , 



l<i<£, 1 < j < N, 



where the Y^ are the measured weights and the Xij denote the corresponding rats' ages (in 
days). The population structure and the conjugate prior specification are assumed to be of the 
customary normal- Wishart-inverse gamma form: For i = 1,2, . . . ,£, 



2 

(J- ~ 



Pi 



iV(/i c , E c 



N (r?, C) 



W((pR)-\p) 



IG 



^0 Wo 



with known values for 77, C, uq, p, R and To- 

The posterior ir has k := 2^ + 2 + 3 + 1 



66 parameters, and MCMC samples from 



((4>i), p c ,Y, c ,Oc) ~ 7r can be generated via conjugate Gibbs sampling since the full conditional 
densities of all four paramete r blocks (</>,;), /y, S r , and al, are easily identified explicitly in terms 
of standard distributions; cf. iGelfand et al. I (ll990h . For example, conditional on (0j),E c ,cr^ and 



the observations (Yij ) , the means \i c have a bivariate normal distribution with covariance matrix 
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V := {IT,' 1 + C- 1 )- 1 and mean 



V 



(29) 



Suppose, first, that we wish to estimate the posterior mean of a c . We use a four-block, 
random-scan Gibbs sampler, which at each step selects one of the four blocks at random and 
replaces the current values of the parameter(s) in that block with a draw from the corresponding 
full conditional density. We set F((fa), fi c , S c , of) = a c , and construct control variates according 
to the basic methodology by first defining k = 66 basis functions Gj and then computing the 
one step expectations PGj. For example, numbering each Gj with the corresponding index in 
the order in which it appears above, we have Gei((fa), Mo S c , of) = a c , and from (|2"9"j) we obtain, 

PG 61 ((fa), Mc , S C! of) = (3/4)a c + (l/^Cffi" 1 + Cr 1 )" 1 [S- 1 ( £ fa) + C^r, 



Figure [2] shows a typical realization of the sequence of estimates obtained by the standard 
estimators fJ, n (F) and by the modified estimators ^ n ,K{F), for n = 50000 simulation steps. The 
variance of n n ,K(F) was found to be approximately 30 times smaller than that of ji n (F). The 
second row of Table [2] shows the estimated variance reduction factors obtained at various stages 
of the MCMC simulation, based on T = 100 repetitions of the same experiment, performed as 
in Example 1. 




10000 15000 20000 25000 30000 35000 40000 45000 50000 



Figure 2: The sequence of the standard ergodic averages is shown as a solid blue line and the modified estimates 
Hn,n{F) as red diamonds. For visual clarity, the values n n ,K{F) are plotted only every fOOO simulation steps. 
The "true" value of the posterior mean of a c , shown as a horizontal line, was computed after n = 10 7 Gibbs steps 
and taken to be w 106.6084. 



The initial values of the sampler were chosen as follows. For the (fa) we used the ordinary 
least squares estimates obtained from £ = 30 independent regressions; their sample mean and 
covariance matrix provided starting values for fj, c and S c , respectively, and a pooled variance 
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Variance reduction factors 




Simulation steps 


Parameter 


n = 1000 


n = 10000 


n = 20000 


n = 50000 


n = 100000 


n = 200000 


fa) 


1.59-3.58 


9.12-31.02 


11.73-61.08 


10.04-81.36 


12.44-85.99 


9.38-109.2 




2.99 


15.49 


32.28 


31.14 


28.82 


36.48 




3.05 


19.96 


34.05 


39.22 


32.33 


36.04 




1.15-1.38 


4.92-5.74 


5.36-7.60 


3.88-5.12 


4.91-5.34 


3.65-6.50 


°i 


2.01 


5.06 


5.23 


5.17 


4.75 


5.79 



Table 2: Estimated factors by which the variance of ^i n {Fj) is larger than the variance of (j,„ : k(F), after n 

000. 10000, 20000, 50000, 100000 and 200000 simulation steps. A different function F 3 is defined for each of the 
k = 66 scalar parameters in the vector ((</>%), He, S c , cr%), and the same vector of control variates is used for all 
of them, as specified by the basic methodology of Section [4] In the first row, instead of individual variance 
reduction factors, we state the range of variance reduction factors obtained on the 60 individual parameters (<j>i), 
and similarly for the three parameters of E c in the fourth row. 



estimate of the individual regression errors provided the initial valu e of aj. The observe d data 
(Yij) and known parameter values for rj, C, vq, p, R and to, are as in lGelfand et al. (Il990h . 
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More generally, in such a study we would be interested in the posterior mean of all of the 
k = 66 model parameters. The same experiment as above was performed simultaneously for 
all the parameters. Specifically, 66 different functions Fj, j = 1,2, ... ,66 were defined, one for 
each scalar component of the parameter vector /x c , S c , a^), and the modified estimators 

fi n ^(Fj) were used for each parameter, with respect to the same collection of control variates 
as before. The resulting variance reduction factors (again estimated from T = 100 repetitions) 
are shown in Tabled 

Example 3. Metropolis-within-Gibbs sampling from a heavy-tailed posterior. We 

consid er an inference problem motivated by a simplified version of an example in Roberts and Rosenthal! 



(|2009h . Suppose N i.i.d. observations y = (yi,yz, ■ ■ ■ ,Vn) are drawn from a N((f>, V) distribu- 



tion, and place independent priors <f> ~ Cauchy(0, 1) and V ~ IG(1, 1), on the parameters (p, V, 
respectively. The induced full conditionals of the posterior are easily seen to satisfy, 

i 

and n{V\(f>,y) ~ IG (l + y, 1 + i - Vi 



Since the distribution Tr(4>\V,y) is not of standard form, direct Gibbs s ampling is no t possible. 



Instea d, we use a random-scan Metropolis-within-Gibbs sampler, cf. iMullerl (|1993l ). iTiernev 



(119941 ). and either update V from its conditional (Gibbs step), or update ^ in a random walk- 
Metropolis step with a 4>' ~ N((p,l) proposal, each case chosen with probability 1/2. Because 
both the Cauchy and the inverse gamma distributions are heavy-tailed, we naturally expect 
that the MCMC samples will be highly variable. Indeed, this was found to be the case in the 
simulation example considered, where the above algorithm is applied to a vector y of N = 100 
i.i.d. iV(2,4) observations, and with initial values c^o = and Vq = 1. As a result of this 
variability, the standard empirical averages of the values of the two parameters also converge 
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very slowly. Since V is the more variable of the two, we let F(cf>, V) = V and consider the 
problem of estimating its posterior mean. 

We compare the performance of the standard empirical averages [J, n (F) with that of the 
modified estimators fx n ^(F). As dictated by the basic methodology, we define Gi(cfi,V) = <f> 
and G2(4>,V) = V, but we note that the one-step expectation PG\{<f>,V) cannot be obtained 
analytically because of the presence of the Metropolis step. Therefore, we use a single control 
variate U = G — PG defined in terms of the basis function G((f>, V) = V. 

Figure [3] shows a typical realization of the results of the two estimators, for n = 10000 simu- 
lation steps. The corresponding variance reduction factors, estimated from T = 100 repetitions 
of the same experiment, are 7.89, 7.48, 10.46 and 8.54, after n = 10000, 50000, 100000 and 
200000 MCMC steps, respectively. 




Figure 3: The sequence of the standard ergodic averages is shown as a solid blue line, and the modified estimates 
H*n,n{F) as red diamonds. For visual clarity, the values (J, n ,K(F) are plotted only every 200 simulation steps. The 
"true" posterior mean of V, plotted as a horizontal line, is estimated to be « 4.254 after 10 million simulation 
steps. 



6 Extensions of the Basic Methodology 

As discussed in the Introduction, unlike in the case of independent sampling where control 
variates need to be identified separately in each particular application, the basic methodology 
developed here provides a simple method for the construction of a family of control variates 
that are immediately applicable to a wide range of MCMC estimation problems. In particular, 
it applies to any Bayesian inference study where samples from the posterior are generated by a 
conjugate random-scan Gibbs sampler. 

In this section we discuss extensions of the basic methodology along three directions where, 
in each case, we indicate how a different class of basis functions {Gj} can either make the use of 
the associated control variates more effective, or it can allow their applications to more general 
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classes of MCMC samplers than those considered so far. Each of these directions is illustrated 
via a detailed MCMC example. 



6.1 'Non-conjugate' samplers 

As noted earlier, the main requirements for the applicability of the basic methodology of Section^] 
are that the MCMC sampler be reversible, and that the conditional expectations PGj{x) = 
E\G j{X n+ i)\X n = x] of (at least some of) the co-ordinate functions Gj be explicitly computable. 
Example 4 below illustrates the case of a (non-conjugate) Gibbs sampler used in the analysis of 
a log-linear model, where, although the functions PGj cannot be computed in closed form, it is 
easy to identify a different collection of basis functions {Gj} for which the values of the {PGj} 
are readily available. With find that, using the resulting control variates Uj = Gj — PGj with 
the modified estimator in Q to estimate the posterior means of the model parameters, reduces 
the estimation variance by a factor approximately between 3.5 and 180. 



Example 4. A log-linear model. We consider the 2x3x4 table presented bv lKnuiman and Speed 



( 19881 ). where 491 subjects were classified according to hypertension (yes, no), obesity (low, av- 



erage, high) and alcohol consumption (0, 1-2, 3-5, or 6+ drinks per day). We choose to estimate 
the parameters of the log-linear model with three main effects and no interactions, specified as, 

yi ~ Poisson(^), log(/ij) = zf/3, i = 1, 2 . . . , 24, 

where the yi denote the cell frequencies, modeled as Poisson variables with corresponding means 
Hi, each Zi is the ith row of the 24 x 7 design matrix z based on sum-to-zero constraints, and 
/3 = /?2, ■ • • , fiif is the parameter vector. [In Dellaportas and Forster (| 19991 ) this model was 



identified as having the highest posterior probability among all log- linear interaction models, 
under various prior specifications.] 

Assuming a flat prior on /3, standard Bayesian inference via MCMC can be performed 
either by a Gibbs sampler t hat exploits the log-concavity of full conditional densities as in 



Dellaportas and Smithl (|1993), or by a multivariate random walk Metropolis-Hastings sampler, 
with a maximum-likelihood estimate of the covariance matrix offering guidance as to the form 
of the proposal density. Instead, here we observe that the full conditional density of each f3j is 
the same as the distribution of the logarithm of a gamma random variable with density, 

Gamma (>>, :,,• >;, ::: . , exp { £^ (3 e z i£ ^ . (30) 

Hence a (reversible) random-scan Gibbs sampler can be implemented by producing samples 
according to the distributions in (j30|) and taking their logarithm to obtain samples from the 
correct full conditional density of each f3j . 

In order to estimate the posterior mean of every component of f3, we consider all seven 
estimation problems simultaneously, and set Fj{(3) = (5j for all 1 < j < 7. But, since the mean 
of the logarithm of a gamma-distributed random variable is not known explicitly, we cannot 
compute the conditional expectations PGj for the co-ordinate functions Gj (/3) = j3j dictated by 
the basic methodology. On the other hand, it is clear that the mean of exp(/3j) under the full 
conditional density of f3j is simply the mean of the gamma density in (|30p . Hence, defining basis 
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functions Gj((3) = exp(/3j) for each j = 1, 2, . . . , 7, we can simply compute, 



PGjGS) = 6 - exp(/3,) + I f ^ ) . 

Therefore, we can use the same k = 7 control variates U%, U2, ■ ■ ■ , for each Fj, where the 
Ui = Gi—PGe are defined in terms of the basis functions above, in conjunction with the modified 
estimators fi n ,K(Fj) as before. The variance reduction factors obtained by fi n ,K(Fj) after n = 
100000 simulation steps range between 57.16 and 170.34, for different parameters f3j. More 
precisely, averaging over T = 100 repetitions, the estimated variance reduction factors obtained 
by Vn,K( F j) are in the ran S e > 3.55-5.57, 38.2-57.69, 66.20-135.51, 57.16-170.34 and 
85.41-179.11, after n = 1000,10000,50000, 100000 and 200000 simulation steps, respectively. 
Figure H] shows an example of the simulated results for a sequence of the estimates ^, n {Fj) and 
/^kC-^V) f° r 07- All MCMC chains were started at the corresponding maximum likelihood 
estimates. 



0.16r 
0.14 
0.12 - 




Figure 4: Log-linear model: The sequence of the standard estimates /i n ,(J 7 V) for /J7 is shown as a solid blue line; the 
modified estimates /i n ,K(f7), plotted every 500, iterations, are plotted as red diamonds. The straight horizontal 
line represents the estimate obtained after 5 million iterations. 



6.2 Complex models 

In some estimation problems involving more complex models, especially when there is reason to 
expect that the posterior distribution is highly non-Gaussian, it may be possible to identify more 
effective basis functions {Gj} than the co-ordinate functions dictated by the basic methodology, 
by examining the form of the associated Poisson equation, as indicated by the first rule of thumb 
of Section I2.21 that is, by choosing the functions Gj so that a linear combination of the form 
@jGj approximately satisfies the Poisson equation for the target function F. One such scenario 
is illustrated in the following example, which presents an analysis of a two-component Gaussian 
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mixtures model. In order to estimate the posterior means of the two location parameters, a 
standard random-scan Gibbs sampler is employed and the samples are re-ordered at the post- 
processing stage. Despite the complex structure of the resulting chain, the basic methodology can 
still be applied, but it is found to make little difference in reducing the estimation variance. On 
the other hand, a straightforward heuristic analysis of the associated Poisson equation indicates 
that the choice of two simple functions G\ , G2 as the basis functions may lead to a fairly accurate 
approximation for the solution of the associated Poisson equation. Indeed, the resulting control 
variates are found to reduce the estimation variance by a factor ranging approximately between 
15 and 55. 



Example 5. Gaussian mixtures. Mixtures of densities provide a versatile class of sta- 
tistical models that have received a lot of attention from both a theoretical and a practical 
perspective. Mixtures primarily serve as a means of modelling heterogeneity for classification 
and discrimination, and as a way of formulating flexible models for density estimation. Although 
one of the fist major success stories in the MC MC community wa s the Bayesian implementa- 
tion o f the finite Gaussian mixtures problem (cf . iTanner and Wond (|1987l ) , iDiebolt and Robert 
(1994)), ;here are still numerous unresolved issues in inference for finite mixtures, as discussed, 



for example, in the review paper by iMarin et all (|2005h . These difficulties emanate primarily 
from the fact that such models are often ill-posed or non-identifiable. In terms of Bayesian 
inference via MCMC, these issues reflect important problems in prior specifications and label 
switching. In particular, improper priors are hard to use and proper mixing over all posterior 
modes requires enforcing label-switching moves through Metropolis steps. Detailed discussions 
of the dange r s eme r ging from prior s p ecifications a nd id entifiability constraints can be found in 



Marin et al.l (|2005l l. iLee et al.1 (120091 1 . Ijasra et al.1 (|2005l l. 



We employ a two-component Gaussian mixtures model. Starting with N = 500 data points 
V = (yi)2/2) • • • -,Vn) generated from the mixture distribution 0.7iV(0, 0.5 2 ) + 0.3iV(0.1, 3 2 ), and 
assuming that the means, variances and mixing proportions are all unknown, we consider the 
problem of estimating the two means. The usual Bayesian formulation introduces parame- 
ters (/ii, fi2, erf , a\, Z,p), as follows. The data are assumed to be i.i.d. from pN(pi, a 2 ) + (1 — 
p)N(fJL2, of); an d we place the following priors: p ~ Dirichlet(<5, 5), the two means /xi, fi2 are inde- 
pendent with each pj ~ N(£, k _1 ), and similarly the variances are indep endent with each aj 2 ~ 
Gamm afa, 0). We adopt the vague, data-dependent prior structure of iRichardson and Green 
( 19971 ); 5 is set equal to 1, £ is set equal to the empirical mean of the data y, k~ 1 / 2 is taken to 
be equal to the data range, a = 2 and = 0.02k -1 . Conditional on (/xi, /X2, a 2 ,a 2 ,p), the latent 
variables Z = (Zi, Z2, ■ ■ ■ , Zjy) are i.i.d. with Pr{Zj = 1} = 1 — Pr{Zj = 2} = p, and, given 
the entire parameter vector (/xi, p2, of, of, Z,p), the data y = (yi) are i.i.d. with each yi having 
distribution N(pj, a 2 ) if Z{ = j, for i = 1, 2, . . . , N, j = 1, 2. 

In order to estimate the location parameters (pi, P2) we sample from the posterior via a 
standard random-scan Gibbs sampler, and we also intr oduce the a prio ri restriction that fii < p2- 
In terms of the sampling itself, as noted, e.g., by IStephens (119971 1. it is preferable to first 
obtain draws from the unconstrained posterior distribution and then to impose the identifiability 
constraint at the post-processing stage. In each iteration, the random-scan Gibbs sampler selects 
one of the four parameter blocks (pLi, /X2), (of, cr 2 ), Z or p, each with probability 1/4, and draws 
a sample from the corresponding full conditional density. These densit i es are all of standard 
form and easy to sample from; see, for example, IRichardson and Green (1997). In particular, 
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the two means are conditionally independent with, 



fij ~ N — '^2 ' where = #{z : = jj, j = l,2. (31) 

Note that the data y have been generated so that the two means are very close, resulting in 
frequent label switching throughout the MCMC run and in near-identical marginal densities for 
Hi and fj,2- 

We perform a post-processing relabelling of the sampled values according to the above re- 
striction and denote the ordered sampled vector by o"i' 2 , o^' 2 , Z°,P°)- I n order to estimate 
the posterior mean of fix, let, 

F(/xi,/i 2 ,crf,cr|,Z,p) := fi° = min{^i,/i 2 }. 

Since all the one-step conditional expectations PGj of the co-ordinate basis functions Gj can 
be explicitly computed (using the full conditional densities of the sampler), the basic method- 
ology can be applied directly in this case. But repeated simulation experiments indicate that 
the gain in variance reduction is generally negligible. Specifically, if all the co-ordinate functions 
are used - one for each parameter in the vector Qui, fj,2, erf, o\, Z,p) - then h Uj k(F) includes a 
linear combination of 505 control variates, since there are N = 500 latent variables Z = (Zi). 
This introduces a large amount of additional variability in ^ n ^(F), and its variance is generally 
larger than that of p, n {F). Moreover, even if the class of basis functions is restricted to the main 
five parameters (pi, fi%, erf, cr^p), we still find that the use of control variates only produces a 
negligible advantage, if any, compared to the standard estimates p, n (F). 

In order to select a potentially more effective collection {Gj} recall that, according to the 
first rule of thumb of Section 12.21 a linear combination of the Gj would ideally satisfy, at 
least approximately, the associated Poisson equation for F, namely, PF — F = —F + tt(F) = 
— p>° + tt(F). The first important observation here is that the right-hand side of the Poisson 
equation is a function of the ordered sample, whereas all the functions Gj considered earlier are 
independent of the ordering of the two means. Therefore, a natural first candidate for a basis 
function is to take Gx (iti , \x-i , a\ , o\ , Z, p) = p,°; then PG\, the expected value of min{/ii,iX2} 
under (f3Tj) . is, 

PGi(fii,H2,vl,<?l,Z,p) = -Gi(zii,//2,crf,crf,Z",p) 



where Vj and r 2 are the means and variance s of // ,- , resp ectively, for j = 1,2, under the full 



conditional densities in (JHTJ) ; see, for example. ICainl ()1994). 



If PG\ — G\ were approximately equal to a multiple of (up to additive constant terms) , we 
could stop here. But (|32p shows that this difference contains several terms unrelated to fi®, with 
significant additional variability. So the natural next step is to select G2 so that the contribution 
PG2 — G2 may approximately cancel out the last three terms in the right-hand side of (j32[) . Since 
the nonlinear terms involving (j) and $ are hard to handle analytically and are also bounded, 
we focus on approximating the \J t( + r| factor. Note that k is typically small compared to n\ 
and ri2, so t\ can be approximated by a\j{Np) and r| by o\j(N(\ — p)). Finally, since the 
influence of a° is likely to be dominant over that of a 2 with respect to a straightforward 
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first-order Taylor expansion shows that the dominant linear term is a°, suggesting the choice 
G f 2 (Ml)M2,crf,crf,Z,p) = a\. 

To compute PG2, first note that the probability that fj,i < fj,2 under (|5Tj) . is, 



p(order) : = 



$(E( f j, 2 \---)-E(pL 1 \---)) 



+ E(a 2 \---) 



where all four expectations above are taken under the corresponding full conditional densities. 
Moreover, since the full conditional of each aj 2 is a gamma density, the expectations of o\, <J2, 
a 2 , and a^, are all available in closed form. Therefore, p(order) can be computed explicitly, and, 

PG 2 {fJ,uH2,cr 2 ,(T2,Z,p) = l-G2(ni,^2,cr 2 ,a 2 ,Z,p) 



{/ti>^ 2 } j 



E(a 2 \ ■ ■ ■)] + - p(order)<Ti + (1 - p(order))<7 2 



where, again, the expectations are taken under the corresponding full conditional densities. 

With this choice of basis functions G = (GijG^)*, we define the corresponding control 
variates U = G — PG, and compare the performance of the modified estimator fj, n ^(F) with 
that of the standard estimates [i n (F). The resulting variance reduction factors (estimated from 
T = 100 repetitions of the same experiment) are 16.17, 25.36, 38.99, 44.5 and 36.16, after 
n = 1000, 10000, 50000, 100000 and 200000 simulation steps, respectively. Figure [5] shows the 
results of a typical simulation run. The initial values of the sampler were taken after a 1000- 
iteration burn-in period, and the horizontal line in the graph depicting the "true" value of the 
posterior mean of F was obtained after 5 million Gibbs iterations. 
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Figure 5: Two-component Gaussian mixture model: The sequence of the standard ergodic averages for /ii is 
shown as a solid blue line and the modified estimates Hti,k{F), reported every 300 iterations, as red diamonds. 



25 



6.3 General statistics of interest 



In estimation problems where the quantity of interest is not the posterior mean of one of the 
parameters (so that F itself is not one of the co-ordinate functions), it is unlikely that the co- 
ordinate functions {Gj} would still lead to effective control variates. Instead, we argue that it 
is more appropriate, in accordance to the second rule of thumb of Section 12.21 to select basis 
functions Gj so that the resulting control variates Uj := Gj — PGj will be highly correlated 
with F. A particular such instance is illustrated in Example 6 below, where we examine a 
Bayesian model-selection problem arising from a two-threshold autoregressive model after all 
other parameters have been integrated out. [Such applic ations have recently found very intense 
interest, especially in the context of genetics; see, e.g., iBottolo and Richardson! (201CJ), where 



the model space is endowed with a multimodal discrete density] Here, the model space is 
sampled by a discrete random-walk Metropolis-Hastings algorithm, and the quantity of interest 
is the posterior probability of a particular model. Defining three basis functions {Gj} that are 
naturally expected to be highly correlated with F, we find that the variance of the modified 
estimators in @ is at least 30 times smaller than that of the standard ergodic averages. 

Example 6. A two-threshold autoregressive model. We revisit the monthly U.S. 
3-month treasury bill r ates, from January 1962 until December 1999, previously analyzed by 



Dellaportas et al.l (|2007l ) using flexible volatility threshold models. The time series has N = 456 
points and is denoted by r = (r t ) = (r* ; t = 1, 2, . . . , N). Here the data are modeled in terms of 
a s elf-exciting t h resho ld autoregressive model, with two regimes; it is one of the models proposed 



bv lPfann et al. I (| 19961 ). and it is defined as, 



Ar = { aw + aurt - 1 rt ~ 1 <Cl 1 + / ai€t r *~ 1 < ° 2 \ (33) 

where Art = r t — r t-i, and the parameters c%, c 2 are the thresholds where mean or volatility 
regime shifts occur. Instead, we re-write the model as, 

. / a i0 + aim-i r t -i<cx \ ( ae t r t - X < c 2 1 , . 



«20 + o.i\T t -\ r t -x > ci J \ cr(l + 7) 1/2 e t r t _i > c 2 

where 7 > — 1 ch aracterizes the jump in a 2 between the two volatility regimes. Whereas 
Pfann et al. (jl996h use a Gibbs sampler to estimate the parameters of the model in (|33p . we 



exploit the parameterization (|3"4"j) as follows. Independent improper conjugate priors are adopted 
for the variance, vr(cr 2 ) oc o --2 , and for the regression coefficients, ir(otij) cx 1, and the prior for 
each of c±, c 2 is taken to be a discrete uniform distribution over the distinct values of {rt}, except 
for the two smallest and largest values so that identifiability is obtained; the prior for 7 is chosen 
to be an exponential density with mean one, shifted to —1. 

The goal is to estimate the posterior probability 7r(c},C2|r) of the most likely model, that 
is, of the model corresponding to the pair of thresholds {c\,c\) maximizing 7r(ci,c 2 |r). In the 
above formulation, (|34p can be written equivalently as, R = Xa + e, where R = (Ar 2 , . . . , Ar^)*, 
a = (oiOj a 20) «2i) 5 e is a zero-mean Gaussian vector with covariance matrix E, and X is the 
design matrix with row t given by (1 rt_i 0), if rt-i < ci, and by (0 1 rt-i) otherwise. The 
covariance matrix of the errors, S, is diagonal with E# = a 2 if rt-i < c 2 , and = (1 + 7)<r 2 , 
otherwise. Integrating out the parameters a and a, the marginal likelihood of the data r with 
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known c\, C2 and 7 becomes, 



p(r|7,ci,c 2 )ocexp|-- |S| + log \X t YT 1 X\ + NlogiR'^R - a t X t YT 1 X&) }, 



where a = {X t X) 1 X t R is the least-squares estimate of a; see, for example. lO'Hagan and Forster 



(2004). After performing a further one-dimensional numerical integration over 7 by numer- 
ical quadrature, the marginal posterior distribution of (01,02) can be written explicitly as 
7r(ci, c 2 |r) oc p(r|cx, C2). Therefore, sampling from the posterior of the thresholds (ci, c 2 ) can be 
performed by a discrete Metropolis-Hastings algorithm on (ci,c 2 ), where the thresholds ci,C2 
take values on the lattice of all the observed values of the rates (rt) e xcept the two farth ermost at 
each end. This way, we replace the 8-dimensional Gibbs sampler of Pfann et al. (1996) for (|33p . 



by a five-dimensional analytical integration over a and a, a numerical integration over 7, and a 
Metropolis-Hastings sampler over (ci,c 2 ). 

Note that this algorithm is computationally less expensive and also more reliable, since Gibbs 
sampling across a discrete and continuous product space may encounter "sticky patches" in the 
parameter space. The discrete Metropolis-Hastings sampler used here is based on symmetric 
random walk steps, with vertical or horizontal increments of size up to ten, over the lattice of 
all possible values. In other words, the proposed pair (c4,c 2 ) given the current values (ci,c 2 ) is 
one of the forty neighboring pairs (c' l5 c 2 ) of (ci, c 2 ), where two pairs are neighbors if they differ 
in exactly one co-ordinate, and by a distance of at most ten locations. [Clearly, here we do not 
touch upon the finer issues of efficient model searching, as these would possibly require more 
sophisticated MCMC algorithms.] 

After a preliminary, exploratory simulation stage, the three a posteriori most likely pairs 
of thresholds were identified as, {c\,c\) = (13.63, 2.72), (c?,c§) = (13.89, 2.72), (cf,c|) = 
(13.78, 2.72). To estimate the actual posterior probability of the most likely model, vr(c}, c 2 |r), 
we define F(ci,e 2 ) = I{( CljC2 ) = ( c i jC i)}. For the construction of control variates, as dictated by 
the second rule of thumb of Section [2721 we select three basis functions {Gj} that would lead to 
functions Uj = Gj — PGj which are highly correlated with F. To that end, we first make what 
is perhaps the most obvious choice, taking G\ = F. Then, we define two more basis functions, 
according to the same reasoning, by taking, Gj(ci, c 2 ) = 1^^ C2 ) = ( c j c j'uj f° r J = 2, 3. 

Observe that, here, since the number of all possible random-walk moves is quite limited, the 
one-step expectations PGj(x) can be written out explicitly and their required values at specific 
points x = (ci,c 2 ) can easily be computed: Indeed, writing (ci,e 2 ) ~ (ci,c 2 ) when (ci,c 2 ) and 
(c^Cg) are neighboring pairs, PGj can be expressed, for j = 1,2,3, 



PGj{ci,c 2 ) = < 



1 ~ h S^McLca) min {l, ffig g }, if (ci,c 2 ) = (4, 4); 

m^ P MM}> if( Cl , C 2)~(ci,4); 

k 0, otherwise. 



The resulting variance reduction factors obtained by fJ> n ,K(P), estimated from T = 100 
repetitions, are 125.16, 32.83, 36.76, 30.90 and 30.11, after n = 10000,20000,50000,100000 
and 200000 simulation steps, respectively. Figure [6] shows a typical simulation run. All MCMC 
chains were started at (c},c3>). 
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Figure 6: Two-threshold autoregressive model: The sequence of the standard ergodic averages (J, n (F) is shown as 
a solid blue line and the modified estimates [in,K {F) , reported every 500 iterations, as red diamonds. The straight 
horizontal line represents the posterior model probability obtained after 50 million iterations. 

7 Further Methodological Issues 

Here we examine a number of further methodological extensions to the results presented so far, 
and we discuss a few different, related approaches. 



7.1 Alternative consistent estimators for 9* 

Recall that the estimator 9 n ^ for the optimal coefficient vector 9* = (9*, 9%, . . . , de- 
fined in Section [3] was motivated by the new representation for 9* derived in equation (|24p 
of Proposition 2. But we also derived an alternative expression for 9* in equation ()23|) . as 
9* = r(G)- 1 7 r((F - tt(F))(G + PG)), where r(G)y = ir(GiGj - (PG i )(PG i )), 1 < i,j < k. 
This suggests that 9* can alternatively be estimated via, 

9 n , r := T n (G)~ l [ix n {F{G + PG)) - /i n (FK(G + PG)], 

where the empirical k x k matrix T n (G) is defined by, (T n (G))ij = [i n (GiGj — (PGi)(PGj)), 
1 < i,j < fe. Then 9 n p can in turn be used in conjunction with the vector of control variates 
U = G — PG to estimate ir(F) via, 



Mn,r(-F) := fi n (F § 



lJL n {F)-(9 n T,Hn{U)) 



1 



n-1 



n * — ' L 

i=0 



) " (8n,T, U) 



(35) 



In theory, the estimators 9 n ^r and fi n ^r(P) enjoy the exact same asymptotic consistency and 
optimality properties as their earlier counterparts, 9 n ^ and ^ n ^{F), respectively; these are 
established in Section [8) Also, the overall empirical performance of 9 n p and fj, nt r(P) was found 
in practice to be very similar to that of 9 Uj k and fi n ,K(P)- This was observed in numerous 
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si mulation experiments we cond u cted, some of which are reported in the unpublished notes 
of iDellaportas and Kontovianni] ifcood ). A small difference between the two estimators was 



observed in experiments where the initial values of the sampler were quite far from the bulk 
of the mass of the target distribution tt. There # n> r sometimes appeared to converge faster 
than 9n t K, and the corresponding estimator fj, n ^(F) often gave somewhat better results than 
Hn,K{F). The reason for this discrepancy is the existence of a time-lag in the definition of $ri,K* 
When the initial simulation phase produces samples that approach the area near the mode of 
7r approximately monotonically, the entries of the matrix K n (Gr) accumulate a systematic one- 
sided error and consequently K n (G) takes longer to converge than T n (G). But this is a transient 
phenomenon that can be easily eliminated by including a burn-in phase in the simulation. 

One the other hand, we systematically observed that the estimator 9 Ut K was more stable than 
9 n ,r, especially so in the more complex MCMC scenarios involving a larger number of control 
variates. This difference was particularly pronounced in cases where one or more of the entries 
on the diagonal of T(G) = K(G) were near zero. There, because of the inevitable fluctuations in 
the estimation process, the values of some of the entries of 9 n ^ fluctuated wildly between large 
negative and large positive values, whereas the corresponding entries of 9 n ^ were much more 
reliable since, by definition, K(G) is positive semidefinite. 

In conclusion, we found that, among the two estimators, fi n ^(F) was consistently the more 
reliable, preferable choice. 

We also briefly menti on that a different method for consistently estimating 9* was recently 



developed in lMev n (2007), based on the "temporal difference learning" algorithm. Although this 
method also applies to non-reversible chains, it is computationally significantly more expensive 
than the estimates 9 n ^ and # n ,r 5 and its applicability is restricted to discrete-state space chains 
(or, more generally, to chains containing an atom). It may be possible to extend this idea to 
more general classes of chains by a simulated construction analogous to Nummelin's "split chain" 



technique, cf. Nummelin ( 19841 ). but we have not pursued this direction further. 



7.2 Estimating 8* via batch-means 

As noted in Section T2.21 the main difficulty in estimating the optimal coefficient vector 9* in (|14p 
was that it involves the solution F to the Poisson equation. Various authors have observed in 
the past (see the references below) that one possible way to overcome this problem is to note 
that 9* (like F itself) can alternatively be written in terms of an infinite series. Restricting 
attention, for the sake of simplicity, to the case of a single control variate U = G — PG based 
on a single function G : X — > R, from equation (fT7"|) we have, 

.. oo 1 oo 

r = -ft E E^FiXoMXj)] = _ Y, EAF(X )U(X,)]. (36) 

U j = -_oo v \ J i j = _ OQ 

This suggests the following simple strategy: Truncate the series in (j36|) to a finite sum, from 
j = —M to j = M, say, and estimate an approximation to 9* via, 



M , n-M 

1 

1 nM 



M n-M 

£ E HXi)U(X i+j ). (37) 



fiJG 2 - (PG) 2 ) ^ n-2M 

K ' ' j=-M i=M+l 
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Then, 

6n,M converges a.s. to, 



1 M 

0m:=— Yl K[F(X Q )U{X 3 )l 



(38) 



-M 



as n — > oo (see Corollary 2 in Section [8]), and one would hope that 6m ~ G* for "large enough" 
M. Using the estimated coefficient 6 nt M in conjunction with the control variate U = G — PG, 
tt(F) can then be estimated by the corresponding modified averages, 



(39) 



This methodology h a s been used exten s ively in the literature, including , a mong others, by 

Andradbttir et al.1 (|l993h . lMira et al.1 (|2003h . Istein et alJ (|2004l ). lMevnl (j2007l ) and lrlammer and Hakon 
(|2008l l. Our main point here is to show that it is strictly suboptimal, and in certain cases severely 
so. To that end, we next give a precise expression for the amount by which the asymptotic vari- 
ance of the batch- means estimators jl n ,M (F) is larger than the (theoretically minimal) variance 
<7g„ of [i n ^(F). Proposition 3 is proved at the end of this section. 

Proposition 3. The sequences of estimators {jl ny M(F)} and {/j, ni K(F)} are both asymptotically 
normal: As n — > oo, 



V 



V 



N(0,r 2 M ) 

n (0,0-1: 



where 



' denotes convergence in distribution. Moreover, the difference between the variance 



of the batch-means estimators £i n ,M(F) and that of the modified estimators ^L n ^(F) is, 



Or, 



E 

\j\>M+l 



E v [F(X )U(Xj)] 



> 0. 



(40) 



It is evident from f)40|) that the variance t m of the batch- means estimators fl n ,M(F) will 
often be significantly larger than the minimal variance Oq* achieved by ^ n ,K{F). Especially so 
if either: (i) The MCMC samples are highly correlated (as is often the case with samplers that 
tend to make small, local moves), so that the terms of the series Ylj ■F 7r [F(XQ)U(Xj)] decay 
slowly with or, (ii) \F\ tends to take on large values; e.g., note that the difference in PU|) 
can be made arbitrarily large by multiplying F by a big constant. But these are exactly the two 
most common situations that call for the use of a variance reduction technique such as control 
variates. 

Indeed, in numerous simulation experiments (some simpl e cases of which are reported in 
the unpublished notes of iDellaportas and Kontoviannisl (fcOQflh ) we observed that, compared to 
fJ"n,K(F), the batch- means estimators ft n ,M(F) require significantly more computation (especially 
for large M) and they are typically much less effective. Also, we are unaware of any reasonably 
justified (non-ad hoc) guidelines for the choice of the parameter M, which is critical for any 
potentially useful application of fL n ,M(F). 

Using the obvious extension of the above construction to the case when more than a single 
control variate is used, we re-examined the three examples of the basic methodology that were 
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presented in Section Tables 0H] and below show the results obtained by the batch- means es- 
timators fl n ,M{F) for various choices of M, together with the earlier results obtained by ^ u ,k{F). 
In each case, the sampling algorithm and all relevant parameters are as in the corresponding 
example in Section [5j In the interest of space, for Example 2 (Table H|) we only display results 
for the problem of estimating what is probably the statistically most significant parameter in 
this study, namely, the mean slope /3 C , which corresponds to taking F((cj)i), fj, c , E c , cr^) = /3 C . 



Variance reduction factors for Example 1 




Simulation steps 


Estimator 


n = 1000 


n = 10000 


n = 50000 


n = 500000 


jin,M( F )i M = 


1.01 


1.01 


1.01 


1.01 


/VmOF), m=i 


1.02 


1.02 


1.01 


1.02 


fj-n,M(F), M = 5 


1.06 


1.06 


1.06 


1.06 


pinM F )i M =10 


1.12 


1.11 


1.11 


1.11 


pin,M( F )i M = 20 


1.26 


1.23 


1.23 


1.23 


Pti,m(F), M = 100 


1.64 


2.78 


2.78 


2.74 


P"ti,m(F), M = 200 


1.88 


8.77 


7.57 


7.44 


fJ>n,K(F) 


4.13 


27.91 


122.4 


1196.6 



Table 3: Estimated factors by which the variance of /in (J 1 ) is larger than the variances of jl n ,M{F) and (J, n ,K(F). 
All estimators are applied to data generated by the random-scan Gibbs sampler for a highly correlated bivariate 
Gaussian density with parameters chosen as in Example 1. Results are shown after n — 1000, 10000, 50000 and 
500000 simulation steps, when the batch-means parameter M = 0, 1, 5, 10, 20, 100 or 200. The variance reduction 
factors are computed from T — 100 independent repetitions of the same experiment. [See the description in 
Example 1.] 



Variance reduction factors for Example 2 




Simulation steps 


Estimator 


n = 1000 


n = 10000 


n = 50000 


n = 200000 


fin,M(F), M = 


1.00 


1.00 


1.00 


1.00 


fin,M( F )' M = 1 


0.94 


1.00 


1.00 


1.00 


fin,M(F), M = 5 


0.24 


1.00 


1.00 


0.99 


»n,M(F), M =10 


0.09 


1.00 


1.00 


0.99 


jln,M(F), M = 20 


0.45 


0.99 


1.00 


0.98 


fl n ,M(F), M = 100 


10" 4 


0.84 


0.95 


0.96 


^n,K{F) 


3.05 


19.96 


39.22 


36.04 



Table 4: Estimated factors by which the variance of /in (J 1 ) is larger than the variances of jl n ,M{F) and fi n ,K(F). 
All estimators are applied to MCMC data sampled from the posterior of the hierarchical linear model described 
in Example 2, and the parameter being estimated is /3 C so that here F((<j)i), /i c , E c , a^) = /3 C . Results are shown 
after n = 1000, 10000, 50000 and 200000 simulation steps, with the batch-means parameter M = 0, 1, 5, 10, 20 or 
100. The variance reduction factors are computed from T — 100 independent repetitions of the same experiment. 
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Variance reduction factors for Example 3 




Simulation steps 


Estimator 


n = 10000 


n = 50000 


n = 100000 


n = 200000 


fl n ,M(F), M = 


1.99 


1.95 


1.99 


1.97 


Pti,m(F), M = 1 


3.39 


4.14 


3.96 


4.65 


fln,M( F )i M = 5 


3.86 


5.72 


7.69 


5.22 


/2 n , M (F), M = 10 


0.44 


3.41 


4.21 


5.99 


fi"n,M{F)i M = 20 


0.15 


0.90 


2.21 


3.18 




7.89 


7.48 


10.4 


8.54 



Table 5: Estimated factors by which the variance of fJ. n (F) is larger than the variances of p, n ,M(F) and fi n ,K(F). 
All estimators are applied to data generated by the Metropolis-within-Gibbs sampler of Example 3, simulating 
a simple heavy-tailed posterior. Results are shown after n = 10000, 50000, 100000 and 200000 simulation steps, 
for the following values of the batch-means parameter M — 0, 1, 5, 10 and 20. The variance reduction factors are 
computed from T = 100 independent repetitions of the same experiment. 



Proof of Proposition 3. The asymptotic normality statements are established in Corollar- 
ies 1 and 2 of Section [8j To simplify the notation we decompose the infinite series in (|36p into 
the sum Sm + Tm, where Sm is the sum of the terms corresponding to — M < j < M and Tm is 
the double-sided tail series corresponding to the same sum over all \j\ > M + 1. The variances 
of the two estimators are given by, 

T li = a F + "M a u ~ 29m(Sm + Tm) 
°e* = a F 2~(Sm + Tm) 2 , 

cf. (fl2|) and ([TB]). respectively. Taking the difference between the two and substituting the value 
of 6m = Sm/&u from (J3S]), gives, 

I \ 

t m ~ a e* = ~[Sm ~ 2Sm{Sm + Tm) + (Sm + Tm) ] = —T M , 
a u a u 

as claimed in (|40p . □ 

In view of Corollary 1 in Section [51 a simple examination of the above proof shows that the 
result of Proposition 3 also holds with the estimators fj, n< r(F) introduced in Section [7TT1 in place 

of Hn,K( F )- 
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8 Theory 



In this section we give precise conditions under which the asymptotics developed in Sections [3] 
and 17.21 are rigorously justified. The results together with their detailed assumptions are stated 
below and the proofs are contained in the appendix. 

First we recall the basic setting from Section [5J We take {X n } to be a Markov chain 
with values in a general measurable space X equipped with a cr-algebra B. The distribution 
of {X n } is described by its initial state Xq = x E X and its transition kernel, P(x,dy), as in 
([5]). The kernel P, as well as any of its powers P n , acts linearly on functions F : X — > M. via, 
PF{x) = E[F{X l )\X = x\. 

Our first assumption on the chain is that it is tp -irreducible and aperiodic. This means 

that there is a cr-finite measure ip on (X, B) such that, for any A E B satisfying ip(A) > and 
any initial condition x, 

P n (x,A) > 0, for all n sufficiently large. 

Without loss of generality, ip is assumed to be maximal in the sense that any other such if)' is 
absolutely continuous with respect to ip. 

Our second, and stro nger, assumption, is an essentially minimal ergodicity condition; cf. 



Mevn and Tweedid (|2009l ): We assume that there are functions V : X — > [0, oo), W : X — > [1, oo), 



a "small" set C E B, and a finite constant b > such that the Lyapunov drift condition (V3) 
holds: 

PV - V < -W + bl c . (V3) 

Recall that a set C E B is small if there exists an integer m > 1, a S > and a probability 
measure v on (X, B) such that, 

P m (x,B) > Su{B) for all x E C, B E B. 

Under (V3), we are assured that the chain is positive recurrent and that it possesses a unique 
invariant (probability) measure tt. Our final assumption on the chain is that the Lyapunov 
function V in (V3) satisfies, vr(y 2 ) < oo. 

These assumptions are summarized as follows: 



The chain is ^-irreducible and aperiodic, with unique invariant 

measure w, and there exist functions V : X — > [0, oo), W : X — > [1, oo), 
a small set C E B, and a finite constant b > 0, such that (V3) holds 
and n(V 2 ) < oo. 



(A) 



Although these condition s may seem somewhat invo lved, th eir verification is generally straight- 
forward; see the texts bv lMevn and Tweedie ll2009h and bvlRobert and Casellal d2004h. as well 



as the numerous example s developed inlRoberts and Tweedie! (1996) , Hob ert and Geverl (1998), 



Jarner and Hansen ( 200d ). Fort et al. (j2003l ). lRoberts and Rosenthall (|2004l ). It is often possible 



to avoid having to verify (V3) directly, by appealing to the property of geometric ergodicity, 
which is essentially equivalent to the requirement that (V3) holds with W being a multiple 
of the Lyapunov function V. For large classes of MCMC samplers, geometric ergodicity has 
been established in the above papers, among others. Moreover, geometrically ergodic chains, 
especially in the reversible ca se, have many attractive properties, as discussed, for example, by 
Roberts and Rosenthal ( 19981 ). 



33 



In the interest of generality, the main results of this section are stated in terms of the 
weaker (and essentially minimal) assumptions in (A). Some details on general strategies for 
their verification can be found in the references above. 

Apart from conditions on the Markov chain {X n }, the asymptotic results stated earlier 
also require some assumptions on the function F : X — > R whose mean under tt is to be 
estimated, and on the (possibly vector-valued) function G : X — > M fe which is used for the 
construction of the control variates U = G — PG. These assumptions are most conveniently 
stated within the weighted-Loo framework of iMevn and Tweedi 1 tosh . Given an arbitrary 
function W : X — > [l,oo), the weighted-Loo space L^ is the Banach space, 

{|LYx)l 
functions F : X R s.t. \\F\\ W := sup ' ; ' < oo }. 
xeX W{x) J 

With a slight abuse of notation, we say that a vector- valued function G = {G\, G%, . . . ,Gk) is 
in L^ if Gj G L^ for each j. 

Theorem 2. Suppose the chain {X n } satisfies conditions (A), and let {9 n } be o- n V sequence 
of random vectors in M. k such that 6 n converge to some constant 9 G M fc a.s., as n — >■ 00. Then: 

(i) [Ergodicity] The chain is positive Harris recurrent, it has a unique invariant (probabil- 
ity) measure tt, and it converges in distribution to tt, in that for any x 6 X and A € B, 

P n (x,A)^Tr(A), as n ->■ 00. 

In fact, there exists a finite constant B such that, 

00 

\P n F(x) - tt(F)\ < B(V(x) + 1), (41) 

n=0 

uniformly over all initial states x G X and all function F such that \F\ < W. 

(ii) [LLN] For any F,G £ and any <& £ R k , write U = G — PG and F$ := F — (0, U). 
Then the ergodic averages [i n {F), as well as the modified averages fj> n (Fg n ), both converge 
to tt(F) a.s., as n — > 00. 

(iii) [PoiSSON Equation] If F e L^, then there exists a solution F G L^ 4 " 1 to the Poisson 
equation, PF — F = —F + tt{F), and F is unique up to an additive constant. 

(iv) [CLT FOR Hn{F)] If F G L^ and the variance, a\ := tt(F 2 - (PF) 2 ) is nonzero, then 
the normalized ergodic averages -^/nf/i^L) — tt(F)] converge in distribution to N(0,ap), 
as n — > 00. 

(v) [CLT FOR Hn(Fe n )} If F,G G L™, and the variances, a 2 Fg := tt(F 2 - (PF e ) 2 ) and 
ay, := 7r(£7? — (PUj) 2 ), j = 1,2, ...,k are all nonzero, then the normalized modified 
averages Vn[fJ-n{Fe n ) — n(F)] converge in distribution to N(0,ap e ), as n — ^ 00. 

Suppose the chain {X n } satisfies conditions (A) above, and that the functions F and G = 
(Gi, G2, . . . , Gfc) 4 are in L^. Theorem 2 states that the ergodic averages ^ n (F) as well as the 
modified averages ^(Fq) based on the vector of control variates U = G — PG both converge to 
tt(F), and both are asymptotically normal. 
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Next we examine the choice of the coefficient vector 9 = 9* which minimizes the limiting 
variance a F of the modified averages, and the asymptotic behavior of the estimators 9 nj r and 
e n ,K for 9*. 

As in Section 12.21 let T(G) denote the k x k matrix with entries, T(G)ij = 7r(GjGj — 
(PGi)(PGj)), and recall that, according to Theorem 2, there exists a solution F to the Poisson 
equation for F. The simple computation outlined in Section 12.21 (and justified in the proof of 
Theorem 3) leading to equation (|14p shows that the variance ap & is minimized by the choice, 

8* = r(G)- 1 7r(FG - (PF){PG)), 

as long as the matrix r(G) is invertible. Our next result establishes the a. s. -consistency of the 
estimators, 

k,v = r n (G)- 1 [^ n (F(G + PG))- / u n (FK(G + PG)] 
§ njK = K n (G)- 1 [/in(i ? (G + PG))-/i n (F)^ n (G + PG)], 

where the empirical k x k matrices T n (G) and K n (G) are defined, respectively, by, 

(r n (G))y = flniG^-flndPGiXPGj)) 
1 n— 1 

and (K n (G))y = — - £(Gi(*t) - PG i (X t _ 1 ))(G, (X t ) - PG^X^)). 

t=i 

Theorem 3. Suppose that the chain {X n } is reversible and satisfies conditions (A). // the 
functions F,G are both in and the matrix T(G) is nonsingular, then both of the estimators 
for 9* are a.s. -consistent: 

# n> r — > 9* a.s., as n —> oo; 

#n,K - > 6* 0,-S-, as n —> oo. 

Recall the definitions of the two estimators ^i n ^{F) and fJ, n ,K(F) m equations ([35]) and ([27|) . 
from Sections [3] and mi respectively. Combining the two theorems, yields the desired asymptotic 
properties of the two estimators: 

Corollary 1. Suppose that the chain {X n } is reversible and satisfies conditions (A). // the 
functions F, G are both in and the matrix T(G) is nonsingular, then the modified estimators 
Hn,r(F) and fin iK (F) for % (F) satisfy: 

(i) [LLN] The modified estimators n nt r(F), ^ n ,K(P) both converge to tt(F) a.s., as n —> oo. 

(ii) [CLT] If a Fgf := vr(F|» — (PFg*) 2 ) is nonzero, then the normalized modified averages 
Vn[fJ"n,r(F) — n{F)] and y/n[ii n ^{F) — tt(F)] converge in distribution to N(0, a F ), as 
n — > oo, where the variance ag* is minimal among all estimators based on the control 
variate U = G — PG, in that oj* = min eeR fe o~n. 
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Finally we turn to the batch-means estimators of Section 17. 21 Recall the definitions of the 
estimators # n ,M and jU n ,Af(-^) m equations (|3"7|) and ([39|) . respectively. Our next result shows 
that # n ,M converges to 9m defined in (|55|) . and gives an a.s.-law of large numbers result and a 
corresponding central limit theorem for the estimators ft n ,M(F). Its proof follows along the same 
line as the proofs of the corresponding statements in Theorem 3 and Corollary 1. [Note that in 
the one-dimensional setting of Corollary 2 the assumption that is nonsingular reduces to 

assuming that afj = tt(G 2 — (PG) 2 ) is nonzero.] 

Corollary 2. Under the assumptions of Theorem 3, for any fixed M > 0, as n —> oo we have: 
(i) [LLN] Q UtM -> 9 M a.s., and jin,M( F ) ~^ 71 ( F ) a - s - 

(ii) [CLT] V^lfinM F ) - < F )\ ^ N (° i t m)> w here the variance t m is given by, 

oo 

r M = a 2 F + e M al-2e M Cov n (F(X ),U(X n )). (42) 

n=— oo 



Some additional re sults on the l o ng-term beha vior of estimators similar to the ones considered 
above can be found in Mevn ( 20061 ) . Mevn ( 2007 . Chapter 11), and finer asymptotics (including 
large deviations bo unds and Edgeworth expansion s ) can be derived under stronger assumptions 
from the results in iKontoviannis and Mevnl (|2003l ) . iKontoviannis and Mevnl ([2005). 



9 Concluding Remarks and Further Extensions 

Summary of results. This work introduced a general methodology for the construction and 
effective application of contro l variates to estim ation problems with MCMC data. The starting 
point was the observation by iHenderson (1997) that, for an arbitrary function G on the state 



space of a chain with transition kernel P, the function U := G — PG has zero mean with respect 
to the stationary distribution ir and can thus be used as a control variate. The two main issues 
treated here are: (i) The selection of basis functions G = {Gj} for the construction of effective 
control variates Uj = Gj — PGj ; and (ii) The problem of effectively and consistently estimating 
the optimal coefficients 6* = {&j} of the linear combination ^ ■ OjUj. The main difficulty was 
identified in Sections [2] and [3] as stemming from the fact that the obvious answer to both of these 
issues involves the solution F of an associated Poisson equation. Since F is well-known to be 
notoriously difficult to compute and only known explicitly in a handful of very simple examples, 
alternative approaches were necessarily sought. 

For reversible chains, in Section [3] we derived new representations of 9* that do not in- 
volve F, and in Section H] we derived the exact solution of the Poisson equation for a specific 
MCMC scenario. These theoretical results motivated the basic methodology for variance reduc- 
tion proposed in Section [H In Section [5] it was applied to three representative MCMC examples, 
demonstrating that the resulting reduction in the estimation variance is generally quite signifi- 
cant. Extensions in several directions were developed in Sectional in each case illustrated via a 
simulation experiment of Bayesian inference via MCMC. 

In Section[71 the most common estimator for 6* that has been used in the literature - based on 
the method of batch- means - was examined, and the resulting estimator for tt(F) was shown to 
be both computationally expensive and generally rather ineffective, often severely so. This was 
demonstrated analytically in Proposition 3 and also empirically via MCMC examples. Section [7] 
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also contains a brief discussion of two alternative approaches for estimating 6* consistently. 
Finally, all methodological and asymptotic arguments were rigorously justified in Section [8l 
under easily verifiable and essentially minimal conditions. 

Applicability. One of the strengths of our approach to the use of control variates in MCMC 
estimation is that, unlike in the classical case of independent sampling where control variates 
need to be identified in an ad hoc fashion for each specific application, the basic methodology 
developed here (and its extensions) is immediately applicable to a wide range of MCMC esti- 
mation problems. The most natural class of such problems consists of all Bayesian inference 
studies where samples from the posterior are generated by a conjugate random-scan Gibbs sam- 
pler. Recall that conjugate Gibbs sampli ng is the key ingr edient in, among others: Bayesian 
inference for dynamic l inear models, e.g., jfteis et all fcOOfih : applicat ions of slice Gibbs with 



auxili ary variables, e.g., Damien et al. ( 1999); Dirichlet process e s, e.g ., MacEachern and Mullerl 



S); and spatial regression models, elZ TGamerrnanar, et all »). 

More generally, the present methodology applies to any MCMC setting satisfying the fol- 
lowing two requirements: That the chain be reversible and that the conditional expectations 
PG(x) = E[G{X n+ i)\X n = x] are explicitly computable for some simple functions G. There is a 
large collection of samplers with these properties, including certain versions of hybrid Metropolis- 
within-Gibbs algorithms (as in Example 3), certain Metropolis-Hastings samplers on d iscrete 



states spaces (as in Example 6), and Markovian models of stochastic networks (as in iMevn 
(|2007l )). To ensure that these two requirements are satisfied, most of the experiments reported 
in Sections and [7] were performed using the random-scan version of Gibbs sampling. This 
choice is not a priori restrictive since the convergence properties of random-scan algorithms are 
generally compar able (and sometime s supe ri or) to those of systemati c-scan samplers; see, e.g., 
the discussions in Diaco nis and Ram (2000). lRoberts and Sahil (|l997h . 

We also observe that, as the present methodology is easily implemented as a post-processing 
algorithm and does not interfere in the actual sampling process, any implementation technique 
that facilitates or accelerates the MCMC convergence (such as blocking schemes, transforma- 
tions, other reversible chains, and so on), can be used, as long as reversibility is maintained. 

Further extensions. Perhaps the most interesting class of MCMC samplers that could be 
considered next is that of general Metropolis-Hastings algorithms. When the target distribution 
is discrete or, more generally, when the proposal distribution is discrete and the number of 
possible moves is not prohibitively large, then our control variates methodology can be used as 
illustrated in Example 6. But in the case of general, typically continuous or multidimensional 
proposals, there is a basic obstacle: The presence of the accept/reject probability in each step 
makes it impossible to compute the required conditional expectation PG(x) in closed form, for 
any G. If we consider the extended chain \{X„ . Y n )} that includes the values of the pr oposed 
moves Y n (as done, e.g., by Hammer and Hakon ( 20081 ) and Delmas and Jourdain ( 20091 )). then 
the computation of PG is straightforward for any function G(x, y) that only depends on x; but 
the chain {(X n ,Y n )} is no longer reversible, and there are no clear candidates for good basis 
functions G. A possibly more promising point of view is to consider the computation of PG 
an issue of numerical integration, and to try to estimate the required values PG(X n ) based on 
importance sampling or any one of the numerous standard numerical integration techniques; 
these considerations are well beyond the scope of the present work. 

In a different direction, an interesting and potentially useful point would be to examine 
the effect of the use of control variates in the estimation bias. Although the variance of the 
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standard ergodic averages fJ> n {F) is a "steady-state" object, in that it characterizes their long- 
term behavior and depends neither on the initial condition Xq = x nor on the transient behavior 
of the chain, the bias depends heavily on the initial con dition and it vanishes asympto t ically . 
Preliminary computations as in the unpublished notes of iDellaportas and Kontovianni^ (2009) 
indicate that the bias of Hn(F) decays to zero approximately like F(x)/n, and that, using on 
a single control variate U = G — PG based on a function G ~ F, can significantly reduce the 
bias. It would be interesting in future work to compute the coefficient vector 9 b which minimizes 
the bias of iJ, n {Fg) for a given collection of basis functions {Gj}, to study ways in which 9 b can 
be estimated empirically, and to examine the effects that the use of 9 b in conjunction with the 
control variates U = G — PG would have on the variance of the resulting estimator for ir(F). 

A final point which may merit further attention is the potential problem of including too 
many control variates in the modified estimator ^ n ,K{F). This issue has bee n studied extensively 
i n the c lassical context of estim a tion based on i. i.d. samples; see fo r example, |Lavcnb erg and W elch 
( 
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Law and Keltonl (jl982l ). lNelson1 (j2004l ). lGlassermanl (|2004l . pp. 200-202). ICaris and Janssend 



(2005). Since the optimal coefficient vector 9* is not known a priori, using many control variates 
may in fact increase the variance of the modified estimators [x n ^L (F) relative to [i n (F) , and care 
must be taken to ensure that the most effective subset of all available control variates is chosen. 
Common sense suggests that the values of all the estimated parameters in the vector 9 n ^ should 
be examined, and the control variates corresponding to coefficients that are approximately zero 
should be discarded. And since the MCMC output consists of simulated data from a known 
distribution, it may be possible to do this in a systematic fashion by developing a classical 
hypothesis testing procedure. 
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Appendix: Proofs of Theorems 2, 3 and Corollaries 1, 2 



Proof of Theorem 2. Since any small set is peti te, Mevn and Tweediel ( 20091 . Section 5.5.2), 
the /-norm ergodic theorem of Mevn and Tweedie ( 20091 ) implies t hat |X n | is positive recur- 
rent with a unique invariant measure tt such that flU} holds, and iMevn and Tweediel (|2009l . 



Theorem 11.3.4) proves the Harris p roperty, giving ( 
From IMevn and Tweediel (|2009l . Theorem 14.0.1 



Since F is in L%, tt(\F 



we have that, under (V3), ir(W) < oo. 

I 

Theorem 17.0.1) shows that fJ> n (F) — > tt(F) and n n (U) —> a.s. as n — > oo, and since 9 n — > 9 by 
assumption, fi n (Fg) also converges to tt(F) a.s., proving (ii). 



is finite, and since G 6 L^, Jensen's inequa lity guarantees that tt(\U\ 
is finite. The invariance of tt then implies that tt(U) = 0; therefore, Mevn and Tweediel ( 200' 



Th e existence of a solution F to the Poisson eq uation in (iii) follows from Mevn and Tweedie 
( 2009 . Theorem 17.4.2), and its un iqueness from Mevn and Tweedie ( 20091 . Theorem 17.4.1). 
The CLT in (iv) is a consequence of Mevn and Tweediel ( 20091 . Theorem 17.4.4). 



Finally, since F, G £ , the functions U and Fq are in too, so Uj and Fg exist for each 
1,2,... , k. As in (iv), the scaled averages y/n\pL n {Fo) — n(F)] and y/rifJL n {JJ j) converge in 
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distribution to N(0, ap e ) and iV~(0, cr^.), respectively, for each j, where the variances ap e and 
Ojj are as in (hi). Writing 9 = (61,62, ... , 6k) 1 and 6 n = (9 ni x,9 ni 2, • • • , #n,fc)*> we can express, 

k 

M^n(Fo n ) - 7T(F)\ = MVn(Fe) - 7T(F)\ + £ { (9 nJ - 9j)^n(Uj)}. 

Each of the terms in the second sum on the right-hand-side above converges to zero in probability, 
since and \fn^i n (Uj) converges to a normal distribution and (9 n j — 6j) — > a.s. Therefore, the 
sum converges to zero in probability, and the CLT in (v) follows from (iv). □ 

Note that the a ssumption afj 7^ in th e theorem is not necessary, since the case afj. = 
is trivial in view of Kontoviannis and Mevn ( 20031 . Proposition 2.4), which implies that, then, 



y/rifj, n (Uj) —7- in probability, as n — > 00. 

Proof of Theorem 3. We begin by justifying the computations in Sections 1231 and l3l Define 
a\ = vr(F| - (PF e ) 2 ), where F exists by Theorem 2. S ince F solves the Poisson equation for F, 
it is easy to check that Fq := F — (6, G) solves the Poisson equation for Fq. Substituting this in 
the above expression for a 2 Fg yields (fT3|) . To see that all the functions in (fT3|) are indeed integrable 
recall that F G and note that, since V is nonnegative, (V3) implies that 1 < W < V + blc, 

hence ir(W 2 ) is finite since tt(V 2 ) is finite by assumption. Therefore, since G 6 L^, F and G 
are both in L2(tt), and Holder's inequality implies that ir(F{6,G)) is finite. Finally, Jensen's 
inequality implies that PF and PG are also in £2(71"), so that ir(PF(6,PG)) < 00. And, for 
the same reasons, all the functions appearing in the computations leading to the results of 
Propositions 1 and 2 are also integrable. 

The expression for the optimal 6* in (|14p is simply the solution for the minimum of the 
quadratic in (fT3"l) . Again, note that F,G,PF and PG are all in Z^tO so 6* is well-defined. 

The consistency proofs follow from repeated applications of the ergodic theorems estab- 
lished in Theorem 2. First note that, since G € and tt(W 2 ) < 00 as remarked above, 
the product GiGj is 7r-integrable, and by Jensen 's inequality so is any pr oduct of the form 
(PGi)(PGj). Therefore, the ergodic theorem of Mevn and Tweedie ( 20091 . Theorem 17.0.1) 



implies that T n (G) — > T(G) a.s. Similarly, the functions F, G, PG, FG and FPG are all ir- 
integrable, so that the same ergodic theorem implies that # nj r indeed converges to 9* a.s., as 
n — > 00. 

To establish the corresponding result for 9 n jz, it suffices to show that K n (G) — > K{G) a.s., 
and to that end we consider the bivariate chain Y n = (X n ,X n+ i) on the state space X x X. 
Since {X n } is ^-irreducible and aperiodic, {Y n } is -irreducible and aperiodic with respect 
to the bivariate measure ip^(dx,dx') := ifj(dx)P(x,dy). Given functions W, V a small set C 
and a constant b so that (V3) holds, it is immediate that (V3) also holds for {Y n } with respect 
to the functions V^ 2 \x,x') = V(x'), W^ 2 \x,x') = W(x'), the small set X x C, and the same b. 
The unique invariant measure of {Y n } is then 7r^ 2 \dx,dx') := -n(dx)P(x,dy) 1 and ■K <y2 \{V^ 1 ) 2 ) 
is finite. Therefore, assumptions (A) hold for j Y n } and, for each pair 1 < i, j < k we can invoke 
the ergodic theorem IMevn and Tweedie! (|2009l . Theorem 17.0.1) for the ir^ -integrable function, 



H{x,x') := (Gi{x') - PGiix^Gjix') - PGj{x)), 
to obtain that, indeed, K n (G) — > K(G) a.s. □ 
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Proof of Corollary 1 . The ergodic theorems in (i) are immediate consequences of Theo- 
rem 2 (ii) combined with Theorem 3. The computation in Section 12.21 which shows that 9* in 
(fT4"|) indeed minimizes a 2 Fg (justified in the proof of Theorem 3) shows that cr|» = min eeK fc a 2 . 
Finally, the assumption that T(G) is nonsingular combined with Proposition 1, imply that all 
the variances afj, must be nonzero. Therefore, Theorem 3 combined with the central limit the- 
orems in parts (iv) and (v) of Theorem 2, prove part (ii) of the Corollary. □ 

Proof of Corollary 2. The a. s. -convergence statements in (i) follow the ergodic theorem, 
as in the proofs of Theorems 2 and 3. The a. s. -convergence of the denominator of (|37j) . /x n (G 2 — 
(PG) 2 ) — > ir(G 2 — (PG) 2 ), is a special case (corresponding to k = 1) of the a.s.-convergence 
of T n (G) to r(G) proved in Theorem 3. Considering the (2M + l)-variate chain instead of the 
bivariate chain as in the proof of Theorem 3, we can apply the ergodic theorem with the same 
integrability assumptions, to obtain that the sum in the numerator of (|37p converges a.s. to 
Y^,\j\<M Ew[F(Xo)U(Xj)], proving that Q n ,M — > 8m a.s., as n — > oo. For the modified averages, 
note that f2 n ,M(F) is simply fJ> n (F§ )• Then the LLN and CLT results for jl nt M(F) follow 
from parts (ii) and (v) of Theorem 2, respectively. Finally, the limiting variance equals cr~ , 
which, using the representation in equation (|16p . can be expressed as claimed in (|42p . □ 
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